We note the application of social context is po- tentially much broader than personal media management, including .... Clusters are built on top of stays using.
Extraction of Social Context and Application to Personal Multimedia Exploration Brett Adams, Dinh Phung, Svetha Venkatesh Department of Computing Curtin University of Technology GPO Box U1987, Perth, 6845, W. Australia {adamsb,phungquo,svetha}@cs.curtin.edu.au
ABSTRACT Personal media collections are often viewed and managed along the social dimension, the places we spend time at and the people we see, thus tools for extracting and using this information are required. We present novel algorithms for identifying socially significant places termed social spheres unobtrusively from GPS traces of daily life, and label them as one of Home, Work, or Other, with quantitative evaluation of 9 months taken from 5 users. We extract locational co-presence of these users and formulate a novel measure of social tie strength based on frequency of interaction, and the nature of spheres it occurs within. Comparative user studies of a multimedia browser designed to demonstrate the utility of social metadata indicate the usefulness of a simple interface allowing navigation and filtering in these terms. We note the application of social context is potentially much broader than personal media management, including context-aware device behaviour, life logs, social networks, and location-aware information services.
Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems
General Terms Algorithms, Human Factors, Experimentation
Keywords Multimedia browsing, social context
1.
INTRODUCTION
Personal captured media, such as photos and videos, have become increasingly easy to capture, but hard to use, author, and share. We can ‘save’ our memories, but they are difficult to reminisce over and communicate. Our capacity
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’06, October 23–27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.
to save continues to grow and with it a sense of urgency to our managing it effectively [33]. Recognition of this has seen the entrenchment of research in ACM communities aimed at needful technologies, including content-derived semantics [24], automatic summarization/composition/transformation [14, 18, 19], rendering of time-based media [7], multi-modal rendering [6], interpretation of the capture act [26], and even a blurring of the distinction between authoring and browsing [4] (non-acm [30]). Taken together, they all address facets of the problem of identifying and leveraging context, a problem addressed by the CHI community from the perspective of decreasing ‘digital burden’, e.g. [16]. The search for context is driven by the recognition that extrinsic properties are often just as significant, if not more so, than intrinsic. For example, a poorly-lit photo of a child might be excluded from a query return set based on visual quality metrics alone, but be included in a system cognizant of the relative rarity of the subject or their relational significance to the searcher. Personal media is thus arguably often socially situated, for an individual and for communities of various makeup (e.g. family, friends). Queries about which media, to whom, and how, are all inherently parameterized by the aspects of social context referred to above; It is a social search problem. This work seeks to formulate and extract social context from location data obtained from GPS signals, and to a lesser extent from persistent audio. We cast this problem in a clustering setting and unsupervisedly discover a user’s traces, formally named stays and places, where a stay is defined as a time-stamped location where the user has spent a reasonable amount of time (e.g., bookshop-at-3pm) and a place, or social sphere, is defined as a time-independent location where several stays have been experienced (e.g. several stays at the bookshop make it a familiar place for the user). We concentrate on four aspects of social context: a) Extraction of significant locational context for a user, wherein we formulate methods to extract places or social speheres and stays and label social spheres as Home, Work or Other; b) Recovery of shared social spheres across multiple users; c) Social ties, indicative of the strength of bond between a pair of users; and d) Construction of a multimedia explorer integrating and using social metadata for media filtering. GPS data is particularly challenging to abstract because data is missing as devices can be turned on or off arbitrarily due to battery failure or a user’s preferences, it is intrinsically noisy and the signal is often lost, particularly indoors. Addressing these issues, we present a novel solution based on DBSCAN, a density-based clustering approach, to en-
able us to deal with large and sparse datasets. We consider a multiple-user approach based on multiple social sphere sets extracted from each user alone enabling us to infer locational co-presence based on shared social spheres. Next, we present a novel formulation for social tie strength, based on frequency of time spent together and the nature of social sphere(s) it occurs within. We choose persistent audio for our experiments to infer the frequency of co-presence because it has a finer resolution than locational co-presence inferred from GPS. We then compute social tie strength between the user and their acquaintances. Personal media can then be viewed as being created within this web of social context, and a novel multimedia exploration environment is presented to demonstrate the utility of this social metadata. Specifically, it is a multi-user spatiotemporal browser with the ability to render images, video, and movies in a unified environment, and filter media items on time, position, labelled significant location, shared location, presence of actor, and social tie strength. We present experimental results on a set of 5 users who collected GPS data and media in their daily lives during a 9 month period and persistent audio was collected for 1 user for 1 month. This data set is used to evaluate the algorithmic performance for extracting stays, social spheres, location co-presence and social ties across these users. Our social metadata aware browser is then presented to the users for media filtering and browsing. To provide further judgement on the proposed framework, we present positive feedback from a user evaluation study. In brief, the novelty of this work lies in a) A framework for multimedia exploration embedding and using social metadata; b) Algorithms to extract and label fundamental categories for stays, social spheres, and their shared positions from fragmented and noisy GPS data; and c) Formulation of social ties and their strength. The significance of this paper can be summarized as follows. • Media are modeled appropriate to the less informationcentric domain of personal media consumption – e.g., relevance can be calculated based apriori on social tie as well as learned from user interaction with media over time. • Social attributes of people and places better align with user recall,1 and support queries of the form “Show me photos or video from the park near home.” • The scope of application of social context is broader than multimedia management. Proactive device behaviour, integration with existing online social network technologies,2 personal life logs [13], as well as fusion with GIS and search technologies,3 would all benefit from such a core representation of social beings. The rest of the paper is as follows. Related work is discussed in section 2. Social context is formulated and presented in section 3. The multimedia explorer is presented in 1 [11] note a breakdown of contextual information used to annotate media–people, events, relationship type (fiancee), activities, etc.–which we note is heavily socially situated, and ‘often relating strongly to autobiographical experience.’ 2 E.g. www.friendster.com 3 ‘Mashups’ with information extracted from google are a popular example.
section 4, along with the user evaluation. Finally, section 5 concludes our paper with perspectives on the future work.
2.
RELATED WORK
Recent popularity of GPS-based devices has boosted a growing research interest in location-based applications. Central to these applications is the need to find significant location, usually assumed to be places where a user spends a considerable amount of time. Early work in this field uses simple heuristics for this task. Marmasse and Schmandt [27] infer time-stamped locations (called ‘stay’ in our paper) by simply checking for regions within a perimeter in which the GPS signal disappears and reappears. This is further improved in the work of [3] by setting time thresholds for the disappearing period. Clusters are built on top of stays using the k-means algorithm to find main locations. User patterns of movement are then learned using Markov models based on signatures of discovered locations. GPS signals are temporal in nature and thus the popular k-means algorithm has many problems when applied to this domain, such as sensitivity to noise and the need to predetermine the number of clusters. Kang et. al [21] attempt to overcome this problem with time-based clustering, which is efficient and can handle online addition of clusters over time. Their idea is to process data points along the timeline, compile duration for each cluster, and use a threshold to eliminate false clusters. However, this approach does not solve the problem of signal loss inherent to GPS devices. In another attempt to improve performance over the k-means, Zhou et al [35] extend the concept of ‘density’ in the DBSCAN algorithm [25] to ‘density-joinable’. This is further improved in [36] to deal with the time issue. The idea is to use the extra time T parameter to refine the neighbouring set of points and define a ‘joinable’ criteria to merge two clusters. Their work aims to deal directly with sparse mobile devices GPS readings and they report a positive result on location finding with a R-TDJ algorithm. They introduce a time relaxation scheme to solve the problem of repetitive short visits for which each visit alone does not accumulate enough points to form a cluster in DBSCAN. In our work, we can effectively solve this problem by not processing data points in a timeline but considering the GPS data for the whole day together, clustering to find the places first and then using time information to expand a place into stays. This avoids the ad-hoc determination of time relaxation in the R-TDJ algorithm. Work aimed at extracting social context is as diverse as the definitions given to the term. One cluster aims to extract some form of socially significant context from multi-sensors more or less unsupervisedly. [22] automatically determine a user’s interruptibility as a function of personal and social situation. They classify activity (sitting, walking, running etc.) from accelerometer logs, and audio scene (e.g. restaurant, conversation, street etc.). [10] present a vision for continuous personal archives augmented by context obtained from opportunistic data sources, such as GPS, email, photos, calendars etc. They segment, cluster and classify audio scenes (e.g. meeting, lecture, street), and present an example visualization fusing that information with email logs and hand-entered calendar appointments. The information is extracted chiefly with a view to its utility in aiding recall for a continuous life log. These approaches focus on classifying the user’s physical situation in isolation, and environment.
The notion of a social network is salient in another cluster of work aimed at supporting distributed collaboration. These approaches mine various information rich sources, such as email, disk activity, online calendars, etc. in order to classify and represent a worker’s network of contacts. E.g., [12] mine email logs automatically to extract social networks, which are then open to traditional network analyses (e.g. clique formation, coherence, spanning, etc.). [5] extract various work-day rhythms in order to optimize communication. These approaches are relationship focussed. Thus there is still a need for multi-scale characterization of socially salient features that includes explicit modelling of the interaction of the social phenomena of places and people. [20] provide a conceptual framework for designing locationaware applications, and note “such systems must integrate information about places with data about users personal routines and social relationships.” We take this as motivation for the development of a novel, social-context-aware media browser. Two aspects of personal media management systems are relevant to this work: dealing with volume of media assets– photos and videos chief among them–and sharing them effectively. Volume creates the interrelated problems of retrieving media, representing it and interacting with it. In the absence of manual annotation, automatic indexing techniques have been borrowed from the general field of semantic image and video classification [17]. Other indices recognized as significant for personal media are implicit to the import or storage of the media itself, such as digital camera upload ‘rolls’, folders, and timestamps [14], which are often embedded by capture devices by default. GPS-derived location is also used to provide a spatial index [34]. [26] explicitly use the assumed context of home video to index video based on the user’s intention in capturing footage. Manual annotation of keywords or people is assumed or supported in many cases, particularly commercial, such as PhotoMesa or Picasa. However, manual annotation is often not performed, due to the sheer volume of items, and the user’s paucity of time or willingness [30]. When such indices are present, the powerful search and interaction mechanisms they enable can be overwhelming for users whose ability to create media is not matched with a similar level of ‘computer savvy’ [23]. There is a significant subset of personal media browsing that is better framed as entertainment-oriented rather than goaldirected and information-centric [33]. Cutting across these issues is that of sharing media with others. Sharing based on keyword tags (www.flickr.com) or manual referral (www.glidedigital.com) are common methods. [2] target the context of a group of friends browsing a shared image collection. Images annotated with keywords and users present are used to drive a number of interesting event visualizations. They introduce the concepts of event conditioning (on user presence), event support, event interest based on sematic distance of metadata, event cones, and viewpoint evolution. These concepts are used to visualize the spatio-temporal evolution of a single user’s events, event history culminating in two friends meeting at a shared event, as well as slideshows of event(s) from a particular user’s viewpoint. The MMM system of [9] allows cell phone client software to upload images together with annotations to a server. Shared metadata can then be spread to unannotated photos via similarity in location and image fea-
tures. These approaches require either explicit action by the media owner, or else willingness at some point to invest in annotating some or all of the items in their collection. In summary, we leverage the location and audio sensing abilities of a device already carried to unobtrusively extract social context in terms meaningful to the user. This is then used to provide browsing functionality peculiar to personal media with zero annotation effort required by the user.
3.
EXTRACTION OF SOCIAL CONTEXT
In this section we define social spheres and ties, and formulate and extract them. We describe our solution to this problem in this section, starting with a brief discussion on data collection and pre-processing.
3.1
Data and Pre-processing
We experiment on a set of N = 5 users4 , whose social interactions can be categorized into three groups: Family = {James, Linda, MumJ}, Workmate = {James, Neil}, and Friend = {James, Josh}5 . Each user carries a GPS device, the main user James is also equipped with an audio recording device. A total of approximately nine months worth of GPS data was collected, in which James accounts for five months, and the rest for one month each. Audio data is collected for one month. GPS device is clearly the choice for most existing locationbased applications since it provides crucial spatial-temporal information about the users. However, signals collected in a real-world scenario, such as in this paper, are extremely noisy and fragmented. There are three main reasons for this: the device may be turned on and off arbitrarily due to the battery failure or user’s preferences; the noise of the GPS signal (typical accuracy is within ±15 meters, but can be worse due to signal scatter) and the loss of signal at any time (e.g., failure to sight satellites, or inside a building). These issues make our problem relatively more challenging than most previously reported work. For example, data for Neil is very fragmented, the GPS is not logged continuously between days, and for some days, only a few hours of data are available. We use three pre-processing procedures to address these problems. Given the GPS data for one day Vi , we use a similar approach in [3] to search for regions where the GPS signal disappears and reappears in the same place exceeding a certain duration threshold. We then attempt to interpolate and fill in the missing GPS signals by random points around that place. We shall refer to this operator as I1 (Vi ). To deal with data from users who often lose signal, or turn their device off, when at home (an important social category in our work), we sometimes interpolate Vi using the previous day GPS Vi−1 and the GPS data of the following day Vi+1 , if it exists. For example, if the last GPS data point of Vi−1 is recorded in the same locale as the first data point in Vi , we interpolate the data between days. We shall refer to this operation as I2 (Vi ). If both I1 and I2 are applied, we simply refer to it as I. Figure 1 shows an example of such interpolations applied within a day (I1 ) , and with the following day (I2 ). 4
Participant names have been altered for anonymity. In real life James and Linda are a couple, MumJ is mother of James, Neil is co-worker with James, and Josh is a friend of James. 5
Figure 1: Example of interpolation applied for 11/12/2005 James’s gps data. GPS readings also come with velocity measures. We can make use of this information to speed up and improve the clustering process. A GPS reading recorded at a position when the user is moving at high speed clearly implies it is not a place of interest, we therefore remove these points from the dataset before clustering. The speed threshold is set automatically to the mean velocity obtained from one day’s GPS data V . We refer to this operation as R(V ).
3.2
Social Sphere or Places
We desire automated discovery of significant places. Using GPS data collected for a particular user, we seek to find an efficient method to partition the data into clusters that are meaningful to the user, where ‘meaningful’ means that they are important to the users socially. Largely different from previous work on GPS-based location discovery (e.g., [15]) where categorical aspects of the places are ignored in the clustering process, our primary interest is in discovering clusters that have social functions. More precisely, we distinguish a social sphere or place l by the fact that l will be visited for a considerable amount of time and repeatedly over time. For example, ‘Home’ and ‘Work’ are the most obvious social spheres or places in our definition; going to the same cinema is another example whereas a one-time stop at a beach is not. Our algorithm thus consists of two parts: a clustering engine that performs place discovery on a daily basis, and a mechanism to add, update, and refine the set of places over time. Given GPS readings, a simple method to cluster the data is the k-means algorithm. While k-means is simple, intuitive and easy to implement, it suffers from three disadvantages: (i) the number of clusters k must be specified in advance, (ii) it favors symmetric shapes (e.g, circles or elliptics), and thus is unable to handle arbitrary shapes, and (iii) it is sensitive to noise and initialization points. The first two problems are more serious to us, since firstly for any arbitrary date, we have very little, if any, knowledge about the places, and secondly the GPS readings are trajectory data, and thus often do not possess symmetric clusters. The first problem is sometimes overcome in the literature by sequentially running the k-means for different k and picking the one that returns the highest average silhouette values, which measures the distance of a point p to other points within its cluster (inner-distance) and to points of different clusters (intra-distance). Unfortunately, computing average silhouette value for each k-means run is computationally expensive. Our experiment shows that even though the performance of this approach is relatively good, it could take almost three hours to run for processing a single day of data.
To circumvent the problems of k-means, we propose the of use of dbscan[25]6 , a density-based clustering algorithm. The main features of dbscan are: (i) it can handle arbitrary cluster shapes, and thus is particularly suitable for large and sparse data sets, (ii) it requires no initialization, and always gives the same results given the same input parameters, and (iii) it is able to exclude noise, outliers and abnormal points. These features make dbscan particular applicable in our setting. We refer the readers to [25] for a full treatment of dbscan, and only summarize the key ideas. dbscan is founded upon three concepts: directly density reachable (→), density reachable (Ã) and density connected (↔). Parameters required for this algorithm are a pair (², D), where ² used to draw a perimeter around a point p to form its neighbouring set N (p | ²), and D serves as a threshold to test if two points p and q are directly density reachable. Two points p and q are then called density reachable, p à q if there is a sequence of points {p1 , . . . , pl } linking them in directly reachable manner, i.e., p → p1 , pi → pi+1 ∀i = 2, . . . , l − 1, and pl → q. Finally, p is called density connected to q if we can find a point o that is density reachable from both p and q. (²,D)
q −→ p ⇔ q ∈ N (p | ²) and |N (p | ²)| > D p à q ⇔ ∃ p1 , .., pl st. p → p1 → . . . → pl → q p ↔ q ⇔ ∃ o st. p à o and q à o A cluster C is then defined as a maximum set of points that are pairwise density connected. dbscan searches for clusters based on optimization criteria that maximizes densityreachability by seeding core points (a point inside the cluster and determined by looking at the number of points in its neighbouring set) and expanding around these points based on reachability condition. The worst case complexity of computing the neighbouring sets for all points is O(n2 ) where n is the number of data points. If R∗ -tree representation is used, this complexity reduces to O(n log n). Thus, dbscan directly poses two problems: (1) It can be slow given its complexity, and (2) we must specify the parameters (², D). These problems can be resolved effectively in our setting. Before applying dbscan, the data is filtered of points above a speed threshold as discussed earlier. For our dataset, this stage usually results on an average in 30 − 50% reduction in the data points. Importantly, it improves clustering accuracy by eliminating noisy (redundant) data points. Regarding the parameters, the dbscan proposed in [25] is also equipped with a heuristics-based procedure to automatically tune the parameters. However, since our primary interest is the dominant places, our choices for (², D) can be reasonably set. We choose ² = 0.001, which is roughly equivalent to 60m, doubling the inherent noise in GPS device, and D = 5×60 which is approximately the number of points generated in 5 minutes, the minimum duration we assume a user must spend at a location before it potentially becomes significant. Recall that we seek to find ‘stays’ and ‘places’ from GPS data, where a stay is a time-dependent unit (e.g., bookshop1pm-2pm) and a place is where several stays are experienced. Assume our data over N days from a user is given 6 dbscan stands for Density-Based Spatial Clustering of Applications with Noise, and note here that the term ‘density’ refers to the concept of ‘denseness, and not ‘density’ as in probability density functions.
Algorithm 1 DBSCAN-based clustering of stays and places. 1: P = ∅, φ = ∅, m = ∅ 2: for i = 1 to N do 3: V i =R◦I1 ([X i ]) {remove points after interpol.} 4: (Ci , ωi ) = dbscan(V i , ², D) 5: Si = FindStay(V i , Ci ) 6: (P, φ, m) = Update(Si , Ci , wi , P, φ, m) 7: Si = UpdateLabel(Si , P ) 8: end for
Refining and updating discovered places Recall that we distinguish a social sphere in our work as a place that is socially significant in the user’s mind. Two quantitative components contribute to this distinction: the duration the user spends at that place, and the number of re-visits. It is therefore desirable for the system not only to keep track of and add new places, but also to refine an existing place Pj based on some measure of significance based on these components. We formulate this measure Sig[Pj ] as follows: Sig[Pj | µ] = φj [1 − Geo (mj | µ)] where the effect of the count mj is modeled according to a geometric Geo (x | µ) pmf parametrized by µ. We wish to model its effect in an exponential manner. That is, when any place is first discovered, it is assigned the same degree of significance, and if it is revisited, this value will increase in an exponential manner, quickly reflecting the importance of that place. To consistently update and refine existing clusters, we update Sig[Pj ] whenever Pj is revisited. Based on the measures for all places, an exponential distribution is fitted and a hypothesis testing on 95% confidence is performed to exclude those places that belong to the remaining 0.05% at the tail of the distribution.
3.3
Results: Stays and Places
The GPS dataset described in section 3.1 is used in our experiment. To provide performance evaluation, we further ask the users to provide their social landmarks as an indication of the groundtruth. The user is asked to recall and provide most meaningful places during time the data is collected. Given the volume of data and the imperfect mem-
ories of whereabouts, it is to be expected that some places discovered by the algorithm do not have corresponding labels. We do not report precision for this experiment. Figure 2 shows an example of places discovered for Josh and MumJ. 40
MumJ
hi or ro pr ac to r Sh op U nk w n1 D au gh Sh t er op -W in U nk w n2
ym G
D oc t C
40
0
La ke
Josh
10
So n Sh oM el
(in thousands)
60
20
H om e
80
(in thousands)
30
20
G
ym Pa rk -K in g St ud en Fr t dB os ka s U nk w n1 U nk w n2 U nk w n3
C
W or k hu rc h Fr -D ar yl
0
H om e
in the form {X 1 , . . . , X N }, where X i is the data collected for the i-th day from a user, dbscan(X, ², D) returns a set of clusters C and a vector of corresponding weights ω. The outline of our stay and place finding algorithm is shown in Algorithm1. This algorithm computes all accumulated places stored in {P, φ, m} where Pj is the jth place coordinates, φj is a weight associated with that place measured as an accumulated number of points belonging to that place over time, mj is a count on the number of stays that occur at Pj . All of these tasks are included in routine Update(.). The set of stays discovered for each day is stored in Si via routine FindStay(.) which performs a simple task of further using time information to segment a cluster (place) into smaller time-stamped chunks (stays) if there is temporal discontinuity in that cluster. Each stay sj ∈ Si is given an unique ID, its start/stop time, and the coordinate of the stay.
Figure 2: Weighted places discovered for MumJ (upper) and Josh (lower) by the algorithm prior to the refining process. Notice that most ‘Unknown’ labels are far right, indicating that they are less meaningful; except the one boxed in the rectangle which may indicate MumJ forgot to label it. The results for discovering places are shown in Table 1. Of most significance are the middle columns, which indicates places discovered that coincide with the groundtruth. All places identified by the users have been found. The right column indicates the number of Unknown labels before and after the refining process. We can see that the refining process has reduced significantly the number of such Unknown labels, and are the clusters that the user has spent time, but not labelled. Unknown clusters, are not indicative of algorithmic error, and possibly correspond to places the user had forgotten to label, or places not deemed important. When users were asked about Unknown places in the first category, they were usually able to identify and label it meaningfully. The latter category of Unknown is interesting: one example is when such a cluster was referred back to the user, it was labelled as “Traffic jam spot”, which clearly is not a significant place socially. User James Linda Neil MumJ Josh
Places Groundtruth Discovered 18 18 17 17 7 7 9 9 8 8
Unknown Places Initial Refined 30 9 4 3 2 2 2 0 3 0
Table 1: Place discovery performance of our algorithm. Columns to the right report the number of ‘Unknown’ places found before and after refinement. The results for discovering stays are reported in Table 2, where the second column is the total number of stays discovered, S and S ∗ are the number of stays corresponding to place groundtruth before and after the refining procedure respectively; R and R∗ are the corresponding recalls respectively. Linda’s relatively low recall appears to be largely due to missing groundtruth. Her typical day is particularly in fragmented time and broad in space, due to the running of
many errands. Hence while these locations qualify as significant, they were overlooked during groundtruthing. User James Linda Neil MumJ Josh Σ
# Stays 438 64 23 48 60 633
S 350 43 21 45 52 511
R 79.9% 67.19% 91.3% 93.75% 86.67% 83.76%
S∗ 370 44 21 47 60 542
R∗ 84.47% 68.75% 91.3% 97.92% 100% 88.49%
Table 2: Statistics for finding stays. S and R are stays and recall, respectively. Similarly, S ∗ and R∗ are figures following refinement.
3.4
Labeling major social labels
In our experiments discussed so far, the users have provided us with some groundtruth about their social landmarks. Ideally, we would wish to equip the system with a learning method to label places discovered by the clustering engine. We address this problem initially in this work, aiming to label two useful social categories ‘Home’ and ‘Work’; and the rest will be labelled as ‘Other’. It is reasonable to assume in many cases that at night time the user will sleep at home and go to work during the daytime on weekdays. This assumption allows us to derive quick and simple methods to label those categories. We describe one such method that works well in our case, and is shown in Algorithm 2. The algorithm attempts to discover the place corresponding to Home or Work by filtering out location data outside the assumed appropriate time ranges, and returns the cluster with maximal duration. In the 5th line, τ (xi ) is the timeconstraint condition. To detect ‘Home’ we specify that τ (xi ) is true iff xi are collected before 7am or after 7pm. To detect ‘Work’ we set the time-constraint periods to 8am-11am and 1pm-4pm and the current day is not on the weekend. Algorithm 2 Pseudocode to label ‘Home’ and ‘Work’. Depends on the condition τ (.), the return cluster will be either ‘Home’ or ‘Work’. 1: Randomly pick N days d1 , . . . , dN 2: Interpolate data for each di , resulting in data Y i 3: V = ∅ 4: for i = 1 to N do 5: extract xi from Y i st. τ (xi ) is true 6: V = V ∪ xi 7: end for 8: [Cl , ωl ] = dbscan(V, ², D) 9: Return l∗ = arg maxl {ωl } The randomization in the first line is to account for spontaneous phenomena in which the user may not be at work or home in our designated times. The algorithm can also be run repeatedly to confirm our hypothesis on ‘Home’ and ‘Work’ labels if required. Figure 3 plots an example of one run for user James. In the future, we may need more sophisticated methods to handle varying user profiles, such as having many part-time jobs, a nighttime job, or even no job at all. Such methods would need to use fundamental assumptions about the need for sleep or socialization, and try alternate hypotheses about temporal patterns. Regardless, the Home/Work distinction has near universal validity [28].
HOME
Figure 3: Visualization of discovering ‘Home’ label for user James. James Fr-Dave Home Church GrandPa 1 HeathCote Shop-Lem GrandPa 2 Work Unknown Shop-SouthL
James Work Gym Shop-Kara
James Church GrandPa 1
Linda Unknown Church GrandPa 1
Linda Fr-Dave Home Church GrandPa 1 HeathCote Shop-Lem GrandPa 2 Unknown Shop-Bull Shop-SouthL
James Home GrandPa 2 Sister Unknown Unknown Unknown
Neil Work Swimming Unknown
Linda
MumJ Son Home Lake
Neil Unknown Shop-Sal
Josh Student Church Home
Son Home Daughter Doctor Cafe Gym
Home GrandPa 2 Lake
Josh Church Home
MumJ
Neil Shop-Chem
Josh Work Shop-DNA
MumJ Unknown
Table 3: Discovered shared places. No shared places for (Linda, Neil) and (MumJ, Josh) were found.
3.5
Results: Shared Places
Table 3 shows the results of the shared places discovered algorithmically. Each sub-section corresponds to a user pair. For each user pair, there is a ordered pair of matching social spheres, that indicate shared co-presence. For example, James and Linda are co-located at some major social spheres such as Home, Church, GrandParent. In real life, they are a couple and spend most time outside work at significant places together. James and Neil are work colleagues, and share the social sphere Work. Interestingly, James and his Mum share social spheres (Home, Son) when they visit each other as shown in the first two entries under James and MumJ. The third entry refers to them visiting a common relative, James’s sister. Linda and Josh go to the same Church, and are siblings in real life. Similarly, all other shared relations correspond to socially significant interactions. Little significance can be attached to the Unknown entries as outlined before. An interesting point to note is that some Unknown places can be labeled because of locational co-presence. For example, James’ Unknown place is the same as Linda’s place labeled Shop-Bull.
3.6
Social Ties: Formulation
Social context, in addition to place, includes relationships termed social ties. A tie may be characterized by an ordered
pair of actors, to borrow a term from social network theory, its nature (e.g. familial, friends, work-related), the strength of the bond, and shared social spheres. We require an estimate of the user’s interaction with others, and there are many ways this can be estimated, such as detection of presence through audio, co-located GPS, active RFID and so on. Regardless of the technique used to assess co-presence, we fomalise social tie with respect to user i as follows: Let user i be observed over a set of S sampled periods (15 minute chunks in our case), and let li be the social sphere of this user at sample s. Then let pi denote the Boolean presence of another actor j in sample s, 1 denoting present and 0 not present. To account for the relative importance of location when users interact (e.g. home is typically more socially significant than the dry cleaners), we introduce ωL as a weight expressing the relative significance of location li . Then, the social tie strength T between actors i and j is defined as:
T (i, j) =
S 1 X pi (s, j)ωL (li (s)) Ni,S s=1
Ni,S =
S X
ωL (li (s))
(1)
(2)
s=1
where Ni is a normalizing constant for actor i over the sample set. T = 1 is interpreted as ‘actor j is always with i’ for the sample set, and T = 0 as ‘j is never seen with i.’ It can be noted T is not commutative, reflecting that the strength of a bond from one person’s point of view isn’t necessarily shared by the other. Relationships carried on at familiar places imbue those places with a derivative significance, and those places in turn may imbue continuing or new relationships carried on there with significance reciprocally. To determine location weights, we use a media-flavoured approach: the significance of a place is proportional to how much media is captured there. Let lm be 1 if media item m was created at location i, and M be the total number of media items captured in sample set S at all locations. Then, ωL is defined as:
ωL (i) =
M 1 X lm (m, i) M m=1
(3)
Other possibilities for calculating location significance include location type, such as Home or Work, or cumulative time spent there. Whatever the flavour of ωL , the assumption is that time spent together is a coarse indicator of significance of the relationship, and locations factor this.
3.7
Social Ties: Extraction
To extract social ties we experiment with persistent audio records together with GPS logs to obtain a record of when and where the user interacts with known acquaintances. This problem is notoriously hard for the following reasons: a) free-placement of microphone resulting in poor audio captures, b) free format in interaction between people resulting in interruption, no pauses between speakers, talking over each other, and c) noisy ambient environments. Though we believe this problem is still open, we present our preliminary results by casting the problem as supervised speaker identification from noisy ambient audio, and relaxing it to a simpler challenge: speaker presence.
We divide a day’s log into contiguous 15 minute sections and seek a binary classification per speaker as to their presence, with a bias toward high precision and uniform recall. Audio samples of 0.25 seconds are first classified as either speech or not by decision tree. Speech samples are then passed to a bank of A speaker-vs-other classifiers. In this case, SVMs were found to perform best from among a number of learners. Speaker training was performed with approximately 6 minutes of speech at varying distances from the microphone, and the other class was generated with all but the held-out speaker and some other speakers not part of the set of interest. Samples with more than one positive classification are ignored. A speaker is classed as present in a 15 minute sample if their identified speech is above a minimum threshold of 25s and the detected speech to identified speech is above a threshold proportion of 0.2, both set emperically. A month’s worth of audio data together with GPS location was recorded using an ultraportable Windows XP device. Actor presence groundtruth was recorded at a resolution of 15 minutes, from 7am until 11pm each day. Figure 4 plots social tie strength, T , for the period for three different flavours of ω calculated using media creations, place label, and duration. As expected, the user’s spouse stands out. Interestingly, T preserves distinctions among workmates. E.g. while the user spent about the same amount of time with both co-workers 1 and 2, the media-flavoured weights show more time was spent interacting with co-worker 2 at work, a media creation hotspot, whereas the balance of duration spent with co-worker 1 was at a nearby cafe. The main difficulty encountered is the false postive rate in speaker identification for this situation, as it has a level of complexity greater than that encountered in classical speaker identification settings. Our results indicate high accuracy in clean situations (Speech 95%+, Speaker ID 85%+), but low accuracy in noisy or “entangled-speech” environments (Speech 80%, Speaker ID Pr. 43% Rc. 12%, which is above random for 4 class classification but not robust enough for general use). Thus although we chose speech for its richness in conveying social situations, we now believe other factors such as co-presence derived from co-located GPS should be used, even if the latter has a much lower resolution in being able to infer co-presence. Our formulation of social tie remains the same, but the extraction of co-presence remains a challenging and exciting open problem.
4.
MULTIMEDIA EXPLORER
In this section we present a novel media browser, SocioGraph, in order to demonstrate the utility of social context metadata for the task of personal media exploration and sharing. Specifically, it is a multi-user spatio-temporal browser with the ability to render images, video, and movies (structured for flexible delivery and containing cinematic and content annotation, detailed in previous work [1]) in a unified environment, and filter media items on time, position, labelled significant place, shared places, presence of actor, and social tie strength. This metadata can be used to filter media in isolation or combination. Spatial and temporal filtering are provided by the field of view and timeline scope, respectively; social spheres are labelled on the map; display of media from actors who share a social sphere with the user can be on or off; actor presence at media capture can be specified to be
0.25 media flavoured function flavoured duration flavoured
Social tie strength
0.2
0.15
† 0 - Strongly disagree, 4 - Strongly agree
0.1
0.05
0
Table 4: General response to Socio-Graph† Question Mean Median I like this media browser 3.4 3 This browser is easy to use 2.9 3 I am satisfied with the media organiza3.1 3 tion A month from now, I would still be in3.6 4 terested in using this browser
Spouse
Co−worker 1
Co−worker 2
Figure 4: Social tie strength, T , for a month of data. any sub-set; and media can be thresholded on the user’s social tie strength with its owner. E.g., queries expressing the following intentions can be formulated with the simple interactions detailed below: Find... • Media taken at home in the previous month • Media owned by anyone from the party I missed • Media from the last family outing to the park near the city An important sub-goal of Socio-Graphs’s design was to greatly simplify and align navigation with a set of conceptions common to personal media browsing regardless of experience with computers, namely social context. This can be viewed as an instance of aligning the software’s ‘structure or paths’ with the user’s [32], thus promoting the ‘disappearance of the interface’ [29]. The browsing environment is primarily 3D, first-person point of view. A timeline is displayed in a pane on the right. Full traversal to any latitude or longitude on the globe is supported in order to enable visualization of shared repositories around the globe. Specific design decisions include unified navigation; Zooming is the metaphor used in both time and space (left and right click zoom in and out respectively), which is able to simultaneously deal with volume of items while providing a measure of orientation [31]. Media item access is simplified also, lacking an array of complex widgets [23]. Double-clicking selects an item, and if it is time-based, clicking again plays it. Serendipity, the possibility for a user to stumble upon unlooked for items, is a desirable trait [32, p. 226], and is achieved via field of view and the hidden interplay between tie strength and shared place filters (e.g. adding shared places viewing relaxes the tie strength threshold at the current place in view if the initial result set is very small). Media items entering or leaving the query set due to filter changes fade in and out dynamically, respectively, providing an additional cue as to the effect of the new filter configuration. A user can import photos, videos, and movies in the abovementioned format. Time of creation for each item is extracted from the EXIF header of JPEGs, and thumbnails for
Table 5: Rating of the usefulness of Socio-Graph’s social context filters† Question Mean Median Social tie strength 3 3 Location 3.9 4 Shared place 3.3 3 Actor presence 3.9 4 Event 3.4 4 † 0 - Strongly disagree, 4 - Strongly agree
videos created with digital cameras. Movie creation time is obtained from the file creation time stamp of the first shot. For users with GPS and audio logs, media items are tagged with position and actor presence when available. If an (interpolated) position is not available for the exact creation time of a media item, a widening neighbourhood in time is searched. For this study, a maximum search range of 30 minutes was used. Media items are tagged with any actors detected present in the 15 minute sample in which its creation time falls. No attempt has been made to improve this annotation using coarser resolutions than 15 minutes or confidence values of actor presence. Media items without position are indexed on the timeline only and rendered in the main pane when selected. Media clustered in time have been recognized as signatures of events [14, 8]. We cluster timestamps agglomeratively with hardwired cut-off at 1 hour (dynamic navigation of the entire cluster tree in the environment is to be implemented). The cut-off was set with an aim to preserve micro-events, e.g. the cluster of photos of cutting the cake within the party event. Euclidian distance of time represented as seconds since an origin was used, and experiments found distance between cluster centroids performed best by cophenetic distance. Significant places and places shared by pairs of users are calculated for every user with the algorithm of Section 3. Social tie strengths are calculated for the user audio logs using Equation 1. A comparative user study was undertaken to evaluate Socio-Graph. It included 7 users (3 of whom took part in the experiment of Section 3), 5 male, 2 female, diverse in life situation, age, and computer competency. Each was given a quick introduction to Socio-Graph, as well as PhotoMesa and Picasa 2, followed by time to play with each, and finally a handful of assigned tasks. In order to isolate evaluation of the browser from the performance of actor presence detection, groundtruth was used in the study. Table 4 records general responses to Socio-Graph, borrowed from the user study of [2]. Table 5 contains responses to questions concerning the usefulness of the social context filters. Table 6 lists responses ranking the three browsers at assigned tasks. Strategies of interaction with Socio-Graph varied among
a.
b.
c.
d.
Figure 5: a. Global zoom; b. City-wide zoom; c. Media clustered at a significant place signals an event; d. A photo at maximum zoom. users. Some common traits included the initial choice of a compass orientation, followed by cycles of diving in and out to inspect media. Where media for a given filter was clustered too densely for the spatial view to be clear, they were pulled apart by zooming within the timeline. Tables 4 & 5 show the reaction to Socio-Graph, and the possibility of filtering with social context, to be very positive. Curiously, both females in the study were ambivalent about the browser’s ease of use. In one case, further comments indicated the zoom metaphor, while uniform in space and time, to be perceived as complex. This user also happened to be a retiree with little computer experience. Table 6 indicates Socio-Graph was clearly preferred for tasks 1 and 3, however there was more confusion over task 2. This stemmed from the much larger variety of search strategies employed. While tasks 1 and 3 were both a short step away from formulation in terms of the social context supported by Socio-Graph, task 2 could be performed by hunting for a visual match (e.g. faces, groups of people), best supported by PhotoMesa and Picasa, specific dates (e.g. birthdays, Christmas), best supported by PhotoMesa and Socio-Graph, or specific people (e.g. babies), best supported by Socio-Graph.
5.
CONCLUSION AND FUTURE WORK
We have presented novel algorithms for extracting important aspects of social context from daily position traces and persistent audio: significant places, characterized by the socially significant labels Home, Work, or Other, together with shared places across users, and a measure of social tie
strength between users. Further we have presented the design and evaluation of a novel personal media browser to demonstrate the utility of social context metadata. A comparative user study indicates the usefulness of this approach. We expect fusion of GPS position with Bluetooth, WIFI, and GSM, to allow finer resolution indoors or greater coverage, and have used Placelab (www.placelab.org) with this in mind. Reliable detection of indicators of interaction are required if the social tie strength measure is to be effective. Future work will focus on improving extraction of speaker presence from audio via audio scene filtering and higher order temporal models, together with fusion of other opportunistic sources.
6.
REFERENCES
[1] B. Adams and S. Venkatesh. Situated event bootstrapping and capture guidance for automated home movie authoring. In ACM International Conference on Multimedia, Singapore, November 2005. [2] P. Appan and H. Sundaram. Networked multimedia event exploration. In Proceedings of the 12th annual ACM international conference on Multimedia, 2004. [3] D. Ashbrook and T. Starner. Learning significant locations and predicting user movement with gps. In Int. Symposium on Wearable Computing, Seattle, WA, October 2002. [4] M. Balabanovic, L. Chu, and G. Wolff. Storytelling with digital photographs. In Proceedings of Conference on Human Factors in Computing Systems (CHI), pages 564–571. ACM Press, 2000. [5] J. Begole, J. Tang, R. Smith, and N. Yankelovich. Work rhythms: analyzing visualizations of awareness histories of
Table 6: Comparative ability to achieve tasks‡ Question PhotoMesa Find media containing someone you don’t know well or haven’t seen recently 2.1 Find a photo of a {house, smile, special occasion} 1.8 Find media from an event that you weren’t at 2.2
Picasa 2.5 2.2 2.5
Socio-Graph 1.3 2.0 1.3
‡ Values are mean ranking, 1 best, 3 worst
[6]
[7]
[8]
[9] [10]
[11]
[12] [13] [14]
[15] [16]
[17] [18] [19] [20]
distributed groups. In CSCW ’02: Proceedings of the 2002 ACM conference on Computer supported cooperative work, pages 334–343, New York, NY, USA, 2002. ACM Press. J. Bitton, S. Agamanolis, and M. Karau. Raw: conveying minimally-mediated impressions of everyday life with an audio-photographic tool. In CHI ’04: Proc. of the SIGCHI conference on Human factors in computing systems, pages 495–502, New York, NY, USA, 2004. ACM Press. J. Boreczky, A. Girgensohn, G. Golovchinsky, and S. Uchihashi. An interactive comic book presentation for exploring video. In CHI ’00: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 185–192, New York, NY, USA, 2000. ACM Press. M. Cooper, J. Foote, A. Girgensohn, and L. Wilcox Temporal event clustering for digital photo collections. ACM Trans. Multimedia Comput. Commun. Appl., pages 269–288, 3(1), Aug. 2005. M. Davis and R. Sarvas. Mobile media metadata for mobile imaging. In IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, July 2004. D. Ellis and K. Lee. Minimal-impact audio-based personal archives. In CARPE’04: Proc. of the the 1st ACM workshop on Continuous archival and retrieval of personal experiences, pages 39–47, NY, USA, 2004. ACM Press. D. Elsweiler, I. Ruthven, and C. Jones. Dealing with fragmented recollection of context in information management. In Context-Based Information Retrieval (CIR-05) Workshop in Fifth International and Interdisciplinary Conference on Modeling and Using Context (CONTEXT-05), 2005. D. Fisher and P. Dourish. Populating the social workscape, uci-isr-02-2. Technical report, UCI Institute for Software Research, Irvine, Ca., 2002. J. Gemmell, A. Aris, and R. Lueder. Telling stories with mylifebits. In IEEE International Conference on Multimedia and Expo, Amsterdam, Netherlands, July 2005. A. Graham, H. Garcia-Molina, A. Paepcke, and T. Winograd. Time as essence for photo browsing through personal digital libraries. In JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 326–335, New York, NY, USA, 2002. ACM Press. R. Hariharan and K. Toyama. Project lachesis: Parsing and modeling location histories. Lecture Notes in Computer Science, 3234:106–124, 2004. J. Ho and S. S. Intille. Using context-aware computing to reduce the perceived burden of interruptions from mobile devices. In CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 909–918, New York, NY, USA, 2005. ACM Press. X.-S. Hua and S. Li. Personal media sharing and authoring on the web. In ACM International Conference on Multimedia, Singapore, November 2005. X.-S. Hua, L. Lu, and H.-J. Zhang. AVE - Automated home video editing. In Proc. of the 11th ACM International Conference on Multimedia, pages 490–497, November 2003. X.-S. Hua, L. Lu, and H.-J. Zhang. Automatically converting photographic series into video. In Proc. of the 12th ACM International Conference on Multimedia, 2004. Q. Jones, S. Grandhi, S. Whittaker, K. Chivakula, and L. Terveen. Putting systems into place: a qualitative study of design requirements for location-aware community systems. In CSCW ’04: Proceedings of the 2004 ACM
[21]
[22]
[23]
[24]
[25]
[26]
[27] [28] [29] [30] [31] [32] [33]
[34]
[35]
[36]
conference on Computer supported cooperative work, pages 202–211, New York, NY, USA, 2004. ACM Press. J. Kang, W. Welbourne, B. Stewart, and G. Borriello. Extracting places from traces of locations. In WMASH ’04: Proc. of the 2nd ACM international workshop on Wireless mobile applications and services on WLAN hotspots, pages 110–118, New York, NY, USA, 2004. ACM Press. N. Kern and B. Schiele. Context-aware notification for wearable computing. In Proceedings of the 7th International Symposium on Wearable Computing, pages 223–230, New York, USA, October 2003. H. Lee and A. Smeaton. Designing the user interface for the fschlr digital video library. Journal of Digital Information, Special Issue on Interactivity in Digital Libraries, 2(4), May 2002. Y.-Y. Lin, T.-L. Liu, and H.-T. Chen. Semantic manifold learning for image retrieval. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, pages 249–258, New York, NY, USA, 2005. ACM Press. J. S. Martin Ester, Hans-Peter Kriegel and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining. T. Mei and X.-S. Hua. Intention-based home video browsing. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, pages 221–222, New York, NY, USA, 2005. ACM Press. M. N. and S. C. Location-aware information delivery with commotion. In Second Int. Symposion on Handheld and Ubiqitous Computing. C. Nippert-Eng. Home and Work. The University of Chicago Press, 1995. D. Norman. The design of everyday things. The MIT Press, London, England, 1999. R. Rajani and A. Vorbau. MemoryNet viewer: Connecting people with media. Technical Report HPL-2003-219, HP Labs, 2003. J. Raskin. The humane interface : new directions for designing interactive systems. Addison Wesley, Reading, Mass., 2000. R. Rice, M. McCreadie, and S.-J. Chang. Accessing and Browsing Information and Communication. MIT Press, 2001. K. Rodden and K. Wood. How do people manage their digital photographs? In CHI ’03: Proc. of the SIGCHI conference on Human factors in computing systems, pages 409–416, New York, NY, USA, 2003. ACM Press. R. Samadani, D. Mukherjee, U. Gargi, N. Chang, D. Tretter, and M. Harville. Pathmarker: Systems for capturing trips. In IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, July 2004. C. Zhou, D. Frankowski, P. Ludford, S. Shekhar, and L. Terveen. Discovering personal gazetteers: an interactive clustering approach. In GIS ’04: Proceedings of the 12th annual ACM international workshop on Geographic information systems, pages 266–273, New York, NY, USA, 2004. ACM Press. C. Zhou, S. Shekhar, and L. Terveen. Discovering personal paths from sparse gps traces. In 1st International Workshop on Data Mining in conjunction with 8th Joint Conference on Information Sciences, July 2005.