Creating Personalized Memories from Social ... - Semantic Scholar

13 downloads 6516 Views 649KB Size Report
offer optional manual interfaces for personalization purposes. We used these ..... processor clusters dedicated to hosting Web applications and serving video.
Creating Personalized Memories from Social Events: Community-Based Support for Multi-Camera Recordings of School Concerts Rodrigo Laiola Guimaraes1, Pablo Cesar1, Dick C.A. Bulterman1, Vilmos Zsombori2, Ian Kegel3 1

CWI: Centrum Wiskunde & Informatica The Netherlands

2

Department of Computing Goldsmiths, University of London United Kingdom

3

BT Research & Technology United Kingdom

[email protected], [email protected], [email protected], [email protected], [email protected]

ABSTRACT The wide availability of relatively high-quality cameras makes it easy for many users to capture video fragments of social events such as concerts, sports events or community gatherings. The wide availability of simple sharing tools makes it nearly as easy to upload individual fragments to on-line video sites. These shared fragments have the value of immediacy, but they are unable to capitalize on complementary content created by others. This article reports on a 10-month evaluation process intended to investigate socially-aware authoring tools. The results provide a set of guidelines for developing novel multimedia sharing applications, where relationships among people on both sides of the camera are important. The resulting prototype software enables community-based users to navigate through a large common content space and to generate highly personalized videos of targeted interest within a social circle. The videos are constructed from a collection of independent recordings of the event, managing the complex interpersonal relationships and timevariant social context within a community of diverse (but related) users. As part of the evaluation process, users were asked to reflect on the prototype application, based on their experiences on using it with their own content recorded at a high school concert. According to the results, a system like ours is a valid alternative for social interactions when apart.

Categories and Subject Descriptors H.1.2 [Models and Principles]: User/Machine Systems – Human factors. H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems – Audio, Video.

General Terms Design, Experimentation, Human Factors, Management.

Keywords Web-Mediated Communication, Privacy, Video Authoring, Video Sharing and Information Management.

1. INTRODUCTION The place: the Exhibition Hall in Prague. The date: August 23, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’11, November 28 – December 1, Scottsdale, Arizona, USA. Copyright 2011 ACM XXX…$10.00. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

2009. Radiohead is about to start their concert. The band invites fans to capture personal videos, distributing 50 Flip cameras. The cameras are then collected, and the videos are post-processed along with Radiohead’s audio masters. The resulting DVD captures the concert from the viewpoint of the fans, making it more immersive and proximal than typical concert productions (see: radiohead-prague.nataly.fr). The concert of Radiohead typifies a shift in the way music concerts – and other social events – are being captured, edited, and remembered. In the past, professionals created a full-featured video, often structured according to a generic and anonymous narrative. Today, advances in non-professional devices are making each attendee a potential cameraperson who can easily upload personalized material to the Web, mostly as collections of raw-cut or semi-edited fragments. From the multimedia research perspective, this shift makes us reflect and reconsider traditional models for content analysis, authoring, and sharing. The existence of many content fragments of common social events has resulted in innovative work on individual fragment retrieval. These include tools for the automatic extraction of instrumental solos and applause sections [11], and crowd-sourcing techniques that use the direct or indirect behavior of participants for improving retrieval results [16]. Once a collection of candidate fragments has been identified, other researchers have explored community video remixing [14] and automatic algorithmic solutions for synchronizing and organizing user-generated content from popular music events [8][15]. All of these solutions have focused on various properties of the video/audio content. Unfortunately, none of the solutions have taken into account the social relationships between performers, users, and recipients of the media. They thus missed the opportunity of creating videos exploiting the social relationships between people participating in the same event. This article considers the case in which performers and the audience belong to the same social circle (e.g., parents, siblings and classmates at a typical school concert). Each participating member of the audience records content for personal use, but they also capture content of potential group interest. This content may be interesting to the group for several reasons: it many break the monotony of a single camera viewpoint, it may provide alternative (and better) content for events of interest during the concert (solos, introductions, bloopers), or it may provide additional views of events that were not captured by a person’s own camera. It is important to understand that the decision to use substitute or additional content will be made in the particular context of each

Purpose

Intelligent processes

Relationship btw performers and people recording

Preparation, capturing

Table 1. Handling Multi-Camera Recordings of Concerts.

“Professional” Scripted DVD

Professional

Human director (planning)

Compete coverage

Anonymous UGC Mashup

Ad-Hoc

Similar likings, idols

Video search, video analysis

Complete coverage

Socially-aware community

Ad-Hoc

Family & friends

Video analysis, Memories, video authoring bonds

user separately: the father of the trombone player is not necessarily interested in the content made by the mother of the bass player unless that content is directly relevant for the father’s needs. Put another way, by integrating knowledge of the structure of the social relationships within the group, content classification can be improved and content searching and selection by individual users can be made more effective. In order to understand the role of the social network among group members in a multi-camera setting, consider the comparison presented in Table 1. This table compares the use of multi-camera content in three situations: by a (professional) video crew creating an archival production, by a collection of anonymous users contributing to a conventional user-generated content (UGC) mash-up, and within a defined social circle as input for differentiated personal videos. (Semi-)Professional DVD-style productions often follow a well-defined narrative model implemented by a human director, and are created to capture the essence of a total event. Anonymous UGC mashups are created from ad-hoc content collections, often based on the primary and secondary content classification methods discussed above. In socially-aware communities, friends and family members capture, edit and share videos of small-scale social events where with the main purpose of creating personal (and not group) memories. Interested readers can view a video that illustrates the general concept of personalized community videos1, and an example of a video generated by our system2. One of the major differentiators of our work is that its primary purpose is not the creation of an appealing video summary version of the event or the creation of a collective collaborative community work. Instead, our system is intended to be a resource that facilitates the reuse of collective contents for individual needs. Rather than using personal fragments to strengthen a common group video, our work takes groups fragments to increase the value of a personal video. Each of the videos created by our system will be tailored to the needs of particular members of the group – the video created for the father of the trombone player will be different than the one for the mother of the bass player, even though they may share some common content. This paper considers the following three research questions in the context of a video editing system for high school concerts: 1.

1 2

Can a socially-aware group editing system be defined in terms of existing social science theories and human-centered processes, and if so, which?

http://www.youtube.com/user/TA2Project#p/u/6/re-uEyHszgM http://www.youtube.com/user/TA2Project#p/u/4/Ho1p_zcipyA

2.

Does the functionality provided by a socially aware editing system provide an identifiable improvement over other approaches for creating personalized videos?

3.

Can these improvements be validated in end-user trials?

Our work has concentrated on the modeling of parents of students participating in a high school concert. In this scenario, parents capture recordings of their children for later viewing and possible sharing with friends and relatives. Working with a test group at a local high school, we investigate how focused content can be extracted from a broad repository, how content can be enhanced and tailored to form the basis of a micro-personal multimedia message packet, and how it can be transferred and shared with family and friends (each with different degrees of connectedness with the performer and his/her parents). Results from a ten months evaluation process provide useful insights into how a sociallyaware authoring and sharing system should be designed and architected, for helping users in nurturing their close circle. The evaluation process with a core group of parents included preliminarily focus groups, recording of a high-school concert, trial and evaluation of the prototype application, and reflection on the overall process and experience. This paper is structured as follows. Section 2 discusses related works with the intention of contextualizing our contributions. Then, Section 3 identifies key requirements for a system like the one presented in this article. The requirements are gathered from interviews and focus groups, while social theories motivate our work. Next, Section 4 answers the first research question, providing a detailed description of our system and how the requirements are met. Lastly, Section 5 reports on results regarding utility and usefulness, thus directly responding the second and third research questions.

2. RELATED WORK A number of research efforts have addressed the use of media for content analysis, semantic annotations, and video mashup generation of video concerts. Naci and Hanjalic [11] report on a video interaction environment for browsing records in music concerts, in which the underlying automatic analyzer extracts the instrumental solos and applause sections in the concert videos, and also the level of excitement along the performances. Fans of a band can be useful for improving content retrieval mechanisms [16]; where a video search engine allows for user-provided feedback to improve, extend, and share, automatically detected results in concepts from video footage recorded during a rock n’ roll festival. Kennedy and Naaman [8] describe a system for synchronization and organization of user-contributed content from live music events, creating an improved representation of the event that builds on the automatic content match. While the previous examples provide useful mechanisms for handing multicamera video recordings, more recent articles focus on particular applications. Martin and Holzman [10] describe an application that can combine multiple streams for news coverage in a crossdevice and context-aware manner. Shrestha et al., [15] report on an application for creating mash-up videos from YouTube recordings of concerts. They present a number of content management mechanisms (e.g., temporal alignment and content quality assessment) that then are used for creating a multi-camera video. Results in the article indicate that the final videos are perceived almost as if created from a professional video editor. While our system applies many of the technologies and mechanisms presented in these articles – content alignment tools, automatic semantic annotation – we extend the problem space to

small-scale social events, in which the social bonds between people plays a major role. Where performers and people capturing the event are socially related, and the recordings will act in the future as common memories. Our work complements past research by providing a system that allows people to remember and share video concert experiences with the aim of fostering strong ties. Our work builds on previous findings in event modeling [18] and video abstraction/summarization [17]. The main difference of our work lays on the fact that we do not aim at providing a complete description of the shared event, but a better understanding of how community media can serve individual needs. Closely related to our work are topics such as home video management and navigation [1], and photo albums creation [12]. However, we go one step further by allowing users to navigate, annotate, and utilize multi-camera recordings: other material contributed by other participants of the shared event.

3. MOTIVATION Recognizing the importance of looking at common media assets as a cornerstone for the sharing of family experiences [19], the system introduced in this paper provides an infrastructure to help smaller groups of people (such as a family, a school class or a sporting club) viewing, creating, and sharing personalized videos. Our system combines the benefits of personal focus – knowing whom you are talking with – within the context of temporally asynchronous and spatially separated social meeting spaces. The motivation of our work is rooted in the inherent necessity of people for socializing and for nurturing relationships. One important aspect of our work is that we decided to follow an interdisciplinary approach in which both technology and social issues were addressed. At the core of this approach was establishing a long-term relationship with a group of parents within the target school as a basis for requirements analysis, concept evaluation and system validation.

3.1 Social Science Fundamentals The system presented in this article is based on two social science theories: the strength of the interpersonal ties and the other social connectedness. Further argumentation about the social theories guiding this system can be consulted in [7].. Based on a number of interviews (see Section 3.2), we can conclude that current social media approaches are not adequate for a family or a small social group for storing and sharing collections of media that is personal and important to them. Our assumption is validated by other studies [5], which concluded that social media applications like Facebook do not take into account the interpersonal tie strength of the users. Granovetter [6] defines interpersonal ties as:

Figure 1. A schematic view of the perceived strength of social bond over time in relation to our scenario. maintained. Social connectedness happens when one person is aware of another person, and confirms his awareness [13]. Such awareness between people can be natural and intense or lightweight. As reported elsewhere, even photos [19] and sounds [3] can be good vehicles for creating the feeling of connectedness. Figure 1 illustrates a schematic view of the perceived strength of a social bond over time, showing reoccurring shared events (“interaction rituals” in the Durkheim sense [4]), with a fading strength of the social bond in between. The peaks in the figure correspond to intense and natural shared moments, when people participate in a joint activity (e.g., a music concert) re-affirming their relationships and extending their common pool of shared memories. The smaller peaks correspond to social connectedness actions, such as sending a text message or sharing a personalized video of the shared event, that build on such shared memories. If we were to follow the social connectedness theory, we would design a system that mediates the smaller peaks and thus helps in fostering relationships over time.

3.2 Family Interviews and Focus Groups In order to better understand the problem space, we involved a representative group of users at the beginning of the evaluation process. The first evaluation, in 2008, consisted of interviews with sixteen families across four countries (UK, Sweden, Netherlands, and Germany). The second evaluation, in-depth focus groups – with three parents each – was run in the summer of 2009 in the UK and in December 2009 in the Netherlands. Further information about these two preliminarily user-centered evaluations can be found elsewhere [20], where details about the demographics, the methodology, and the results are reported. Since requirements for our system are partially derived from the results, we will summarize here key findings about current practices and challenges about media sharing.

If we were to design a video sharing system intended for family and friends, we would exploit the social bonds between people by taking into account their personal relationships (intimacy). The system would provide mechanisms for personalizing the resulting videos (adding personal intensity) with some effort (amount of time), and would allow the recipient to acknowledge and reply to the creator (reciprocity).

As social connectedness theory suggests, many participants engaged in varied forms of media sharing as they felt that reliving memories and sharing experiences help bring them (and other households) together. Parents e-mailed pictures of the kids playing football to the grandparents, shared holiday pictures via Picasa, or on disk, or using Facebook, enabling friends and families to stay in touch with each other’s lives. Nevertheless, the interviewed people said that if they shared media, they would do so via communication methods they perceived as private and then only to trusted contacts. There was a general reticence from the parents towards existing social networking sites. From the focus groups it became clear that current video sharing models do not fit the needs of family and friends. Much richer systems are needed and will become an essential part of life for family relationships.

Social connectedness theory helps us to understand how social bonds are developed over time, and how existing relationships are

A number of parents reported photography as a hobby and would routinely edit their shared images. Their children, on the other

“…a combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie.”

hand, even if interested in photography, seemed less keen to manually edit the pictures, and declared a strong preference for automatic edits or relied on their parents. The participants would then discuss the incidents relating to the pictures later on with friends and family, on the phone or at the next reunion. Home videos tended to be watched far less frequently, although the young pre-teen participants appreciated them and were described by their parents as having “worn the tape[s] down” from constant viewing when much younger. In general the participants’ responses converged to: • a willingness to engage in varied forms of recollections through recorded videos; • a clear requirement for systems that could be trusted as ensuring privacy; • a positive reaction to the suggestion of automatic and intelligent mechanisms for managing home videos. In each case, creating personalized presentations (tailored for family use) remained a core issue.

3.3 Requirements The general user needs, social theories and initial interviews with focus groups have lead to a number of requirements for community-support for multi-camera recordings of shared events: •

Connectedness: it should provide tools and mechanisms for maintaining relationships over time. The goal is not so much on supporting high intensity moments – the event – but for the small peaks of awareness (recollection of the event);



Privacy and control: current video sharing models do not fit the needs of family and friends. Our system should address their concerns, and target social community-based use cases;



Effortless: people are reluctant to invest time in processes they consider that could be done automatically. Our system should include automatic processes for analyzing, annotating, and creating videos of the shared event;



Effort, intimacy and reciprocity: while such automatic processes lower the burden for the user, they do not conform with existing social theories. Since we do not want to limit the joy of handcrafting videos for others, our system should offer optional manual interfaces for personalization purposes.

We used these requirements as the basis for the system design discussed in the next section.

4. DESIGN GOALS AND ARCHITECTURE The working title of our editing system is MyVideos. MyVideos is developed as a collection of configurable processes, each of which allowed us to study one or more aspects of the development of socially-aware videos. This section focuses on the general architecture and the design decisions regarding MyVideos. Note that the purpose of this paper is not to describe any one component in detail, but to illustrate the interplay of facilities required to support social media sharing. The design and architecture are direct results from the evaluation process reported in this article.

4.1 System High-Level Description The high-level workflow of MyVideos is sketched in Figure 2. The input material includes the video clips that the parents agree to upload, together with a master track recorded by the school. All the video clips are stored in a shared video repository that also serves as a media clip browser in which parents, students, and authorized family members can view (and selectively annotate) the videos. Privacy and a protected scope for sharing was a key

Figure 2. High-level workflow of MyVideos. component of the system. Each media item is automatically associated with the person who uploaded it, and there are mechanisms for participants to restrict sharing of certain clips. Participants can use their credentials for navigating the repository – those parts allowed to them – and for creating and sharing different stories intended for different people. At capturing time, there are no specific filming requirements for users. They can record what they wish using their own camera equipment. The goal is to recreate a realistic situation, in which friends and families are recording at a school concert. This flexibility comes at a cost, however, since most existing solutions that work well in analyzing input datasets are not that useful for our use case. As indicated in the related work [15], handling usergenerated content is challenging, since it is recorded using a variety of devices (e.g., mobile phones), the quality and lightning are not optimal, and the length of the clips is not standard. As described below, each clip is automatically analyzed and tagged with information on performer, instrument, song, and special events. In practice, this automatic annotation must be fine-tuned manually, either by parents/users or by a school curator.

4.2 Integrating Social Theory Structuring In order to support the social theories described in Section 3.1, MyVideos incorporates a number of automatic, semi-automatic and manual processes that assist in the creation of personal memories of the event. These processes balance convenience and personal effort when making targeted, personalized videos. Emotional intensity is provided thanks to a recommendation algorithm that searches for people and moments that might bring memories to the user. For mediating intimacy, MyVideos provides users the means to enrich videos for others by including highly personalized content such as audio, video, and textual comments. With these approaches we intend to increase the feeling of connectedness, particularly among family members and friends who could not attend the social event.

4.2.1 Reflecting Personal Effort Users of MyVideos can automatically assemble a story based on a number of parameters such as people to be featured, songs to be included, and duration of the compilation. Such selection triggers a narrative engine that, in less than three minutes, creates an initial video using the multi-camera recordings. The narrative engine selects the most appropriate fragments of videos in the repository

MyVideos provides such interfaces for manually fine-tuning video compilations. Users can improve and personalize existing productions by including others video clips from the shared repository. For example, a parent can add more clips in which his daughter is featured for sharing with grandma, or he can instead add a particularly funny moment from the event when creating a version for his brother. Results of the prototype evaluation indicate that users liked such functionality, which automatic processes are not able to provide – it is quite complicated to design an algorithm that detects “funny” events that are based on personal histories! Figure 3. Comparison between automatic and manual generation of video compilations. Automatic methods are not sufficient for creating personal and intimate memories. based on the user preferences and assembles them following basic narrative constructs. Given an example query: SelectedPersons=[Julia, Jan Has]; SelectedSong=[Cry Me River];

4.2.2 Supporting Emotional Intensity Another assumption leading the design of MyVideos was that users are particularly interested in finding video clips in which people close to them are featured. Results of the prototype evaluation indicate that emotional searchers were the most popular. Our system incorporates a recommender algorithm that it is optimized based on the profile of the logged user. This optimization takes into account semantic annotations associated to the user’s media and on the subjects that more frequently appear on his recordings.

1.

Select video clip that is in sync with the audio

The core of the navigation interface for visualizing the multicamera recordings is such optimized recommender algorithm that returns relevant fragments of video clips based on given parameters. MyVideos implements four different filters for emotional searches: instruments, songs, people, and cameras. The first filter will return video clips in which a particular instrument appears. The second one will return those belonging to a specific song during the concert. The third filter will query about the close friends and family member of the logged users. And the last one will return video clips taken by one of the cameras (e.g., the camera of the user logged in or the camera of his wife). These filters can be combined in complex queries, thus simplifying the searching process. For example, a father can request for his son and daughter playing “Cry Me River”, since he remembers this was a very emotive moment of the concert.

2.

Ensure time continuity of the video

Given an example query:

3.

If there are more potential clips that ensure continuity, select those with Person annotations matching the user choices stored in SelectedPersons

SelectedDuration=[3minutes] The algorithm extracts the chosen song from the master audio track, and uses its structure as backbone for the narration. It then selects all the video content aligned with the selected audio fragment; the master video track provides a good foundation and a possible fall back for the event fragments that are not well covered by individual recordings. The audio object is the leading layer and, in turn, it is made of AudioClips. This structure generates a sequence of all the songs that relate to the query. As soon as the length of the song sequence extends beyond the SelectedDuration, the compilation is terminated. The video object has the role of selecting appropriate video content in sync with the audio. An example of the selection criteria is the following:

4.

If the result consists of more clips, select those whose Instruments annotation match the instruments that are active in the audio layer

5.

If the result consists of more clips, select those whose Person annotation match the persons currently playing a solo

Once the automatic authoring process is complete, a new video compilation is created in which the selected song and people are featured. As reported elsewhere [21], such narrative constructs have been developed and tested together with professional video editors. Our assumption, based on the social theories, was that automatic methods -- while useful -- were not sufficient for creating personal memories of an event. Such assumption is validated by our results (see Section 5). Participants found the assembled videos appealing, but they wanted (manual) tools to further personalize the stories. Figure 3 shows a comparison between automatic and manual generation of mashups. Automatic techniques are better suited for group needs such as a complete coverage or a summary of the event, but are not capable of capturing subtle personal and affective bonds. We argue instead for hybrid solutions, in which manual processes allow users to add their personal view to automatically assembled videos.

SelectedPersons=[Julia, Jan Has]; SelectedSong=[Cry Me River]; The result will be: QueryPersons(Julia, Jan Has) ∩ QueryEvents(Cry Me a River) The query algorithm works as follows: 1.

Select fragments of the video clips matching the query; in case of complex queries select intersecting the sets

2.

If the result consists of one fragment, return it

3.

If the result consists of more than one fragment, order the resulting list based on: •

The requested person



The annotations associated to the video clips uploaded by the logged user (affection parameter)



The subjects that appear more frequently in the video clips uploaded by the logged user (affection parameter)

In addition to the query interface that allows users to find moments of the videos that they particularly remember, MyVideos offers optional manual interfaces for improving semantic annotations. When users are searching for specific memories, it

Figure 4. Elements of an authored video composition: the parent has included an introductory image and a video for making it more personal and intimate. might happen that results are not accurate due to mistakes in the annotations. Our system allows users to correct such annotations while previewing individual clips. For example, they can change / add / remove the name of the performer and the title of the song. Moreover, they can add extra annotations (e.g., solo). According to our results, such functionality was appreciated even though some participants were reluctant to make changes they consider the system should have done correctly in the first place.

4.2.3 Supporting Intimacy and Enabling Reciprocity Apart from allowing fine-tuning of assembled videos, our system enables users to perform enrichments. MyVideos provides mechanisms for including personal audio, video, and textual commentaries. For example, these could be recordings of oneself (the author of the production) or subtitles aligned with the video clips commenting the event for others. Users can as well record an introductory audio or video, leading to more personalized stories. Figure 4 illustrates some elements of an authored video, where a parent has created his own version of a concert. Moreover, he has added some personal assets, such as an introductory image and a video recording of himself that acts as the message envelope. According to our results, this functionality (we call it “Capture Me”) would be used by all but one of our participants. Our application addresses reciprocity by enabling life-long editing and enriching of compiled videos. As indicated before, videos created with our tool can be manually improved and enriched using other assets from the repository, adding personal video and audio recordings, and including subtitles or textual comments. This is possible because of the underlying mechanism used for creating compilations. A high-level description language, in our case SMIL, allows us to temporally describe the contents of the compilation. The actual video clips are included by reference, and never modified or re-encoded. In other words, we do not burn a new compilation, but just describe it in a SMIL file that then is shared with other people. As shown in a previous publication [2] ,there are benefits of such approach. In the context of this article, the main benefit is that it enables people to comment, enrich, and improve existing compilations. Thus, users receiving an assembled video can easily include further timed comments as a reciprocity action for the original sender. A grandmother might add a “Isn’t Julia cute” reply in a reciprocal message to her son.

4.2.4 Design Goals relative to Requirements In addition to supporting emotional intensity, intimacy, and reciprocity, MyVideos meets the other requirements identified in Section 3.3. Using a trusted storage media server (provided by the school) fulfills the requirement of privacy. Parents can upload the material from the concerts to a common media repository. The repository is a controlled environment, since it is provided and maintained by the school, instead of being an external resource controlled by a third-party company. Moreover, all the media material is tagged

Figure 5. Temporal alignment of a real life data set from a concert, where a community of users recorded video clips. and associated with the parent who uploaded it, and there are mechanisms so parents can decide not to share certain clips in which their children appear. Users can use their credentials for navigating the repository – those parts allowed to them – and for creating different stories for different people. The requirement on effortless is met by the provision of a number of automatic processes that analyze and annotate the videos, and that help users to navigate media assets and to create memories. First, our system uses two automatic tools for analyzing and annotating the ingested video clips: a temporal alignment and a semantic video annotation tools. The temporal alignment tool is used to align all of the individual video clips to a common time base. The core of the temporal alignment algorithm is based on perceptual time-frequency analysis with a precision of 10 ms. Figure 5 shows the results of the temporal alignment of a recorded real life dataset (more information on the datasets will be provided in Section 5.1). The level of accuracy of our tool is of around 99%, improving other state-of-the-art solutions [8][15]. Since the focus of this paper is not on content analysis, we will not further detail this part of the system. Nevertheless, the interested reader can find the algorithm and its evaluation elsewhere [9]. The Semantic Video Annotation Suite provides basic analysis functions, similar to the ones reported in [15]. The tool is capable of automatically detecting potential shot boundaries, of fragmenting the clips into coherent units, and of annotating the resulting video sub-clips. Second, our system uses other automatic processes for visualizing media assets and for effortless creation of videos. As introduced in the previous subsections, users can navigate the video clips using a recommender algorithm, and they can automatically generate a video compilation from the multicamera recordings. These automatic tools provide a baseline-personalized video. Our personalization extension tools allow further transformation to highly targeted presentations from a common media resource.

4.3 System Implementation Architecture In order to promote easy user sharing, the client aspects of MyVideos have been implemented a full-fledge Web application. The system is hosted on a dedicated research testbed with a high bandwidth symmetrical Internet connection and virtualized processor clusters dedicated to hosting Web applications and serving video. In our architecture, each school would rent space and functionality on the testbed, in order to make systems like ours available to their community. In our current implementation, the testbed space was provided to members of the community. Users could easily upload their videos captured at a social event, which then was automatically processed: content segmentation, temporal alignment, and semantic annotation. By accessing a Web interface users have access to multi-camera recordings – included the ones recorded by other people at the same event. They can as

Figure 6. Interface for automatically assembling stories.

Figure 7. Interface for navigating the video clips (timeline).

well enhance and personalized automatically generated mashup videos combining the recordings. The server portion of MyVideos contains four main components: a Mongrel Web Application Server for the Ruby on Rails (RoR) Application, whose role is to present the user interface and manage user interactions; a MySQL Database that stores all data (users, rights, relationships) and semantic annotations (MPEG-7 descriptions); the automatic narrative interface engine, a tool that automatically assembles video clips based on the recordings; it is invoked by the RoR Application via an HTTP call; a Video Server that stores the recorded video clips and delivers them via HTTP video streaming.

not that interested in physiological issues related to specific interfaces. Instead we want to better understand the social interactions around our system and how well social connectedness is supported. In order to validate our assumptions we run a number of experiments: some testing the functionality and others involving users. The following subsections detail the experiments, report on the results and analyze them.

Only the Application Server and the Video Server are directly accessible via the Internet, while the remaining components run inside the cloud. The client side is implemented using JavaScript and AJAX. From the client side, the only requirements are to have installed a SMIL plug-in3 and the ability to decode MPEG-4 part 10 (H.264) video streams. When the user accesses the Website, the Application Server queries the database and composes the resulting Web page returning it to the user’s browser (adding the correspondent URL of the video clips in the Video Server if necessary). When the user wants to automatically assemble a video, the Application server makes a POST request to the narrative interface engine’s HTTP server.

5. EVALUATIONS The evaluation reported in this article is part of an extended study to better understand the intersection of social multimedia and social interactions. The final intention is to provide technological solutions for improving social communications between families living apart. Potential users have been involved in the process since the beginning, starting with interviews and focus groups, leading up to an evaluation of the prototype. A selected set of parents from a high school in Amsterdam has been actively collaborating with this research. Starting in December 2009, the parents were invited to a focus group that took place in Amsterdam; in April 2010 they recorded (together with some researchers) a concert of their children; in July 2010 they used our application with the video material recorded in the concert. This section reports on the evaluations carried out during the journey. The final goal of the experiments is to evaluate the utility and usefulness of MyVideos. In particular, we are interested in whether the functionality provided by a socially-aware editing system like ours provides an identifiable improvement over other approaches for creating personalized videos. Ultimately, we are 3

http://ambulantplayer.org/

5.1 Experiments 5.1.1 Video Recordings The first set of experiments consisted in the recording of three different concerts: a school rehearsal in Woodbridge in the UK, a jazz concert by an Amsterdam local band called the Jazz Warriors, and a school concert at the St. Ignatius Gymnasium in Amsterdam. The intention behind these concert recordings was twofold: to better understand the problem space and to gather real life datasets that could be used for testing the functionality. In the first concert, Woodbridge, a total of five cameras were used to capture the rehearsals. The master camera was placed in a fixed location, front and centre to the rehearsal, set to capture the entire scene (a ‘wide’ shot) with no camera movement and an external stereo microphone in a static location near to the rehearsal performance. This experiment was very useful for testing the automatic processes provided by our system: the temporal alignment algorithm, the semantic video annotation suite, the automatic authoring tool, and the recommender system for visualizing multi-camera recordings. Then, at the end of November 2009 a concert of the Jazz Warriors was captured as part of an asset collection process. The goal of the capture session was to gain experience with an end-user setup that would be similar to that expected for the Amsterdam school trial. The concert took place on 27 November 2009 at the Kompaszaal, a public restaurant and performance location in Amsterdam. The Jazz Warriors is a traditional big band with approximately 20 members. In total eight (8) cameras were used to capture the concert, where two cameras were considered as ‘masters’ and were placed at fixed locations at stage left and stage right. In total, more than 200 video clips and approximately 80 images were collected at the event. The longest video clip was 50 minutes, the shortest 5 seconds. The first two concerts were primarily experimental, providing us enough material for fine-tuning the automatic and manual processes. On 16 April 2010 the concert from the Big Band (St. Ignatius Gymnasium) was recorded. In this case parents took part in the recordings and provided the research team with all the material. In total around 210 media objects were collected for a concert lasting about 1h and 35 minutes. Twelve (12) cameras

were used; two of them used as the master cameras. These recordings were used for evaluating the prototype.

5.1.2 Prototype Evaluation In cooperation with the local high school, a core group of seven people were recruited, among relatives and friends of the kids that performed in the third concert, for evaluating the prototype. The rationale used for recruitment was to gather as many roles as possible for better understanding the social needs of our potential users. The participants were three high school students, a social scientist, a software engineer, an art designer and a visual artist, resulting in a variety of needs that may influence the video capturing, editing and sharing behaviors. All the participants were Dutch. The average age of the participants was 37.1 years (SD = 20.6 years); 3 participants (42.8%) were female. Among our participants, 3 had children (ranging in age from 14 to 17 years). All participants were currently living in the Netherlands, but an uncle of a performer that lives in the US was recruited to serve as an external participant (the only one that was not present in the concert recordings). The prototype evaluation was conducted over a two–month span in the summer of 2010 (Jul-Sep). The review group was kept small so that we could establish directed and long-term relationships. The qualitative nature of our interactions provided us with a deep understanding of the ways in which people currently share experiences to foster strong ties. The participants represent a realistic sample for the intended use case: all the parents have kids going to the same high school, all of them tend to record their kids, and some of them have some experience with multimedia editor tools. Moreover, the parents were involved in previous focus groups dating from December 2009 and they recorded their kids playing in the Big Band concert (April 2010). The goals of the study make it impossible to do crowdsource testing, since users of the system should be people that care about the actual content of the videos. We used multiple methods for data collection, including in person interviews, questionnaires, and interaction with the system. Figure 6 shows the interface our participants used for automatically assembling stories. Users only have to select the subject matter (people, songs, instruments) and two parameters (style, and duration). Then, by pressing the “GO” button, a story will be automatically assembled. Figure 7 shows how a user can temporally – there are other navigation interfaces – the multicamera video material. In the figure, the video clips in which a performer connected to the user appears are highlighted in orange. The experiment consisted of 2 sessions. The initial session was used to collect background information about video recording habits, their intentions behind these habits and their social relations around media. We also used this session as an opportunity to understand how the participants conceptualize the concert. The second (in-depth) session was dedicated to capture video editing practices and media sharing routines of the participants based on their interactions with the system.

a) Social media habits Questions

Questionaire Results

Did you like MyVideos? Does MyVideos help you recalling memories of social events? Would you like to see what other parents and friends have recorded in the same event? Is MyVideos better than traditional systems to browse videos recorded by other people? Is MyVideos better than traditional systems to find people you care about? Would you add/correct more metadata by using MyVideos? Is the material you capture enough to create video productions? Would you create more video stories with MyVideos (compared to existing tools)? Would you create videos stories faster using MyVideos (compared to existing tools)? Would you create video stories more easily using MyVideos (compared to existing tools)? Would you fine-tune automatic generated video stories by using manual tools? Would you personalize videos by adding yourself to the story? (“capture me”) Which approach did you like the most?

Automatic Manual

Would you share more videos with MyVideos? Is MyVideos safer than other video sharing systems? Would you pay for MyVideos?

b) Utility and Usefulness of MyVideos. Figure 8. Results of the questionnaire in the field trial.

5.1.3 London Student Study

5.2 Results

Another smaller and more incidental evaluation was performed for validating and contrasting the results obtained when evaluating the prototype. In this case, we asked British students about their social media habits. We then introduced the system to them, and let them play with it for a full day. A total of 20 kids completed the questionnaires. The students were from London high schools and were between 14 and 18 years old.

Figure 8 shows the results of some of the questionnaires filled by the participants. In all the cases, the participants interacted with the system visualizing media assets and creating stories. The top part (a) shows their responses to questions related to media capturing, editing, and sharing. The bottom part (b) shows their answers to the questions related to utility and usefulness, including comparisons to other existing solutions. Regarding common practices, all participants reported they record videos in social events (e.g., family gatherings and vacation trips).

Figure 10. A participant using the automatic authoring of stories functionality. system. Optional manual processes for personalizing and finetuning were welcomed and requested. Figure 9. Results of the questionnaire to high school students in London (14-18 years old). However, most of them said that they barely looked at the recorded material afterwards. For most of them, video editing is time consuming and way too complicated. Nevertheless, some are familiar with video editing tools. When prompted if and how they share their videos, our participants repeatedly said that in general they do not post personal videos on the Web. While the youngest participants argued their personal videos were not interesting enough to share on social networking services, for our older respondents privacy was the main reason not to share personal videos on the Web. On the other hand, all participants do appreciate watching videos on YouTube, and most of them have a Facebook account. However, only a few do collaborative tagging of videos and/or photos. The responses from the London high school questionnaires are captured in Figure 9. In general, we can talk about a generation that has gowned up watching and sharing digital videos. Regarding MyVideos, the kids graded it similarly than the evaluators of the prototype, indicating that our results can be generalized further than initially expected.

5.3 Findings Based on our observations, the responses to the questionnaires, and the analysis of the collected audio/video, we present here results regarding effort, emotional intensity, and intimacy. We end this subsection with a discussion on social connectedness. In general the participants liked MyVideos and considered its functionality useful, since it allowed them to use recordings done by others.

5.3.1 Effort A surprising outcome was that participants were not really into editing before sharing videos. Participants rarely edited the video clips they have captured. The automatic authoring process was generally appreciated, making feel the users that they would create stories faster and easier. Overall, the assembled videos were considered visually compelling. As an example, Figure 10 presents the comments of a user while interacting with the system. Nevertheless, the participants considered that creating memories require using the optional manual processes included in the

“I want more portraits of my daughter (in this automatic generated compilation)... is it possible to edit an existing movie (in the Editor)?” (Father of a performer)

5.3.1.1 Emotional Intensity The participants appreciated the filter (recommendation) functions offered by MyVideos because such features skip all the rest and focus on the topics of interest. Interestingly, with the users we spoke to, a person be considered featured in a video (and suitable for search) only if it was a prominent shot (e.g., close-up or solo). Most of the participants reported they would not be interested in a shot in which the subject of their interest barely appears. We believe that the relative low score on “Is MyVideos better than traditional systems to find people you care about?” is because in this version of the system we did not incorporate a quality filter. “If he is in the background but he is on the shadow it is OK but I would like to see a video in which he really shows up… My mother (performer’s grandma) would not enjoy seeing this video of him because there is not much to see” (Uncle of a performer) When we prompted our participants about how much time they were willing to spend in adding/correcting annotations, they were quite resistant, arguing that this process demands too much effort. Nevertheless, the questionnaires show a clear improvement over existing solutions.

5.3.2 Intimacy Regarding the optional manual processes, for most of our participants the ‘capture me’ function – functionality for including personal assets (as shown in Error! Reference source not found.) in a video compilation – was seen as a way to personalize videos for a target audience. As shown in the results, such functionality was generally appreciated. Some of the participants told us they would use this functionality when creating a birthday present video, for instance.

5.3.3 Social Connectedness In general we can conclude that a system like the one presented in this paper improves social connectedness. Participants largely agreed that MyVideos would help them in recalling memories. It would allow them to create more videos and to share them with others. All these actions will support the small peak of awareness shown in Figure 1.

6. DISCUSSIONS AND CONCLUSIONS This article reports on a 10-month evaluation process that has lead to the development of a novel application. The application follows a set of guidelines for nurturing strong ties: reflecting personal effort, supporting emotional intensity, supporting intimacy, and enabling reciprocity. In particular, our application allows users to navigate and share video clips from a social event – a concert – they took part in. While research in this area in the past concentrated in providing incremental improvements, our goal is to emphasize the importance of the interpersonal relationships between the performers and the audience. We believe our findings are relevant for future research in this new direction: sociallyaware authoring and sharing tools. Results from the evaluation process indicate that users generally liked the system. Based on their comments, reactions, and answers to the questionnaires, we can conclude that participants appreciated the benefits of such a system and considered it as a valuable vehicle for remembering events. Some of them were proud about their own productions, but in most cases they were actively looking for video clips of their close friends and relatives. They complained – and desperate - when the quality of the video was not good enough or when the metadata of the videos was wrong. And most importantly, they wanted to share video clips featuring members of their close circle. “Can I send it now?” was the reaction of one of the participants after seeing a video clip he especially liked. In the period since our initial investigations in 2008, we have learned many lessons. The most important is the necessity of providing good contextual information when visualizing large data sets. In terms of content analysis, participants looked for those special moments of the concert. Better affective tools are needed that are capable of recognizing such moments (e.g., a solo or a jam session) from low quality user-generated content. As we learned from users, the tools should be also capable of evaluating the quality of the shot and filter out segments, in which the performer is not well portrayed. As shown in this article, our application meets requirements needed for social communities that are not addressed by existing social media Web applications. It pays special attention to privacy and trust; it aims at effortless interaction, while providing support for emotional intensity and intimacy. Moreover, it provides useful mechanisms for navigating complex media information. We hope that at the end of the journey, users will have a tool that improves social communications with those they care about.

REFERENCES [1] Abowd, G. D., Gauger, M., and Lachenmann, A. 2003. The Family Video Archive: an annotation and browsing environment for home movies. In Proceedings of the ACM SIGMM International Workshop on Multimedia Information Retrieval, pp. 1-8. [2] Cesar, P., Bulterman, D.C.A., Jansen, J., Geerts, D., Knoche, H., and Seager, W. 2009, Fragment, Tag, Enrich, and Send: Enhancing the Social Sharing of Videos. ACM TOMCCAP, 5(3): article number 19, [3] Dib, L., Petrelli, D., and Whittaker, S. 2010. Sonic souvenirs: exploring the paradoxes of recorded sound for family remembering. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 391-400. [4] Durkheim, E. 1971. The elementary forms of the religious life. Allen and Unwin.

[5] Gilbert, E. and Karahalios, K. 2009. Predicting tie strength with social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 211-220. [6] Granovetter, M. S. 1973. The Strength of Weak Ties. American Journal of Sociology, 78(6): 1360-1380. [7] Guimaraes, R.L., Cesar, P., Bulterman, D.C.A., Kegel, I., and Ljungstrand, P. 2011. Social Practices around Personal Videos using the Web. In Proceedings of the ACM International Conference on Web Science. [8] Kennedy, L. S. and Naaman, M. 2009. Less talk, more rock: automated organization of community-contributed collections of concert videos. In Proceedings of the International Conference on World Wide Web, pp. 311-320. [9] Korchagin, D., Garner, P.N. and Dines, J. 2010. Automatic Temporal Alignment of AV Data with Confidence Estimation. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing. [10] Martin, R., and Holtzman, H. 2010. Newstream: a multidevice, cross-medium, and socially aware approach to news content. In Proceedings of EuroITV, pp. 83-90. [11] Naci, S. U. and Hanjalic, A. 2007. Intelligent browsing of concert videos. In Proceedings of the ACM International Conference on Multimedia, pp. 150-151. [12] Obrador, P., Oliveira, R., and Oliver, N. 2010. Supporting personal photo storytelling for social albums. In Proceedings of the ACM International Conference on Multimedia, pp. 561-570. [13] Rettie, R. 2003. Connectedness, awareness and social presence. In Proceedings of the Annual International Workshop on Presence. [14] Shaw, R. and Schmitz, P. 2006. Community annotation and remix: a research platform and pilot deployment. In Proceedings of the ACM international Workshop on HumanCentered Multimedia, pp. 89-98. [15] Shrestha, P., de With, P.H.N., Weda, H., Barbieri, M., and Aarts, E.H.L. 2010. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the ACM International Conference on Multimedia, pp. 541-550. [16] Snoek, C., Freiburg, B., Oomen, J., and Ordelman, R. 2010. Crowdsourcing Rock N’ Roll Multimedia Retrieval. In Proceedings of the ACM International Conference on Multimedia, pp. 1535-1538. [17] Truong, B. T. and Venkatesh, S. 2007. Video abstraction: A systematic review and classification. ACM TOMCCAP, 3(1). [18] Westermann, U. and Jain, R. 2007. Toward a Common Event Model for Multimedia Applications. IEEE Multimedia, 14(1): pp. 19-29. [19] Whittaker, S., Bergman, O., and Clough, P. 2010. Easy on that trigger dad: a study of long term family photo retrieval. Personal Ubiquitous Computing, 14(1): 31-43. [20] Williams, D., Ursu, M. F., Cesar, P., Bergstrom, K., Kegel, I. and Meenowa, J. 2009. An emergent role for TV in social communication. In Proceedings EuroITV, pp. 19-28. [21] Zsombori, V., Frantzis, M., Guimarães, R. L., Ursu, M. F., Cesar, P., Kegel, I., Craigie, R. and Bulterman, D. 2011. Automatic Generation of Video Narratives from Shared UGC. In Proceedings of the ACM Conference on Hypertext and Hypermedia, pp. 325-334.

Suggest Documents