Multimed Tools Appl (2011) 53:391–429 DOI 10.1007/s11042-010-0502-6
Towards the semantic and context-aware management of mobile multimedia Windson Viana & Alina Dia Miron & Bogdan Moisuc & Jérôme Gensel & Marlène Villanova-Oliver & Hervé Martin
Published online: 24 March 2010 # Springer Science+Business Media, LLC 2010
Abstract Users of mobile devices can nowadays easily create large quantities of mobile multimedia documents tracing significant events attended, places visited or, simply, moments of their everyday life. However, they face the challenge of organizing these documents in order to facilitate searching through them at a later time and sharing them with other users. We propose using context awareness and semantic technologies in order to improve and facilitate the organization, annotation, retrieval and sharing of personal mobile multimedia documents. Our approach combines metadata extracted and enriched automatically from the users’ context with annotations provided manually by the users and with annotations inferred by applying user-defined rules to context features. These new contextual metadata are integrated into the processes of annotation, sharing and keyword-based retrieval. Keywords Semantic web . Spatial reasoning . Context awareness . Multimedia annotation . Semantic image retrieval . Index expansion
Supported by CAPES - Brazil
W. Viana (*) : A. D. Miron : B. Moisuc : J. Gensel : M. Villanova-Oliver : H. Martin Laboratory of Informatics of Grenoble(LIG), STEAMER Team, 681, rue de la Passerelle, 38402 Saint Martin d’Hères, France e-mail:
[email protected] A. D. Miron e-mail:
[email protected] B. Moisuc e-mail:
[email protected] J. Gensel e-mail:
[email protected] M. Villanova-Oliver e-mail:
[email protected] H. Martin e-mail:
[email protected]
392
Multimed Tools Appl (2011) 53:391–429
1 Introduction Recently, due to the popularisation of internet and software infrastructures, it has become very easy to create and publish digital content for all the users of the World Wide Web, even with no technical background. Indeed, a tremendous quantity of multimedia documents (photos, videos, audio clips) produced by individuals are published on the Web using portals such as Yahoo!, Flickr, TripperMap, DailyMotion and YouTube. This “Web of people” has led to the evolution of the original Web 1.0 towards Web 2.0 [16], where content is created and managed by a broad public, leading to many social interactions occur between users. In order to facilitate searching and browsing through these very large multimedia document repositories, most Web 2.0 systems provide users with tools allowing them to describe the content of these documents by means of series of tags or free text [1]. However, the manual annotation of various personal multimedia documents can be a very tedious and time-consuming task. In order to assist users and to simplify the annotation task, some systems propose automatic annotations approaches, in which the application itself performs the annotation of the documents. In content-based approaches, the application extracts low level features from the document (e.g. colour, pitch, textures, etc.) and tries to transform them into higher semantic level attributes. Content-based annotation and retrieval approaches are very specific to the type of media concerned (video, image, audio, text) and have the major disadvantage of not bridging completely the “semantic gap” between machine-extracted low level features and the high level semantic constructs manipulated by the end-user. On the opposite, context-based approaches try to bridge this semantic gap by using elements related to the context in which the document was captured or created (like an event in which the user has participated or a location he has visited), which are much more akin to the way users formulate their queries [25]. These context elements are automatically added as annotations and used for information retrieval. Unfortunately, most context-based approaches add annotations as simple text keywords, without considering the semantic relations that hold between the keywords and the content of the document. As a consequence, for document retrieval those approaches adopt purely syntactic matching between keywords, which may lead to erroneous results. We consider that automatic context-based annotation can be very useful for users as far as personal multimedia document annotation, organization, retrieval and sharing are concerned. We argue that a context-based approach allows simplifying those four activities for all types of personal multimedia documents, regardless of their type (video, image, audio or text), since context elements used for annotation are the same regardless of the media. Furthermore, we consider that the annotation and, as a consequence, the organization, retrieval and sharing of personal multimedia documents should go beyond a simple list of tags and include the semantics of the relation between the annotation and the content of the document. Thus, the retrieval process, for example, can bypass the syntactical matching problems. In this paper, we propose to use context-awareness and semantic technologies in order to improve the management of mobile multimedia. In our previous work [40], we have proposed PhotoMap, a context-based approach for photo annotation and organisation purposes. In this paper, we extend PhotoMap in order to develop a more complete contextaware multimedia management system. We now provide a vocabulary in order to characterize the user’s context when he creates a multimedia document (e.g., photos, videos, audio). The new system generates enriched metadata (i.e., the automatic annotation of the multimedia document) such as location, spatial relationships, nearby objects, season,
Multimed Tools Appl (2011) 53:391–429
393
light status, nearby friends. We also include user-defined inference rules that enrich the signification of the contextual metadata (e.g., users can specify that an address x is their home address).These new contextual metadata are integrated into the processes of annotation, sharing and keyword-based retrieval. Thus, we aim at supporting semantic search and facilitating the sharing of mobile multimedia. For instance, we extract from these generated metadata a set of tags for indexing the multimedia documents. In our retrieval approach, each index term is composed of a word and a semantic stamp that represent the relationship between the multimedia document and the annotation (e.g., video captured in “Paris”). The paper is structured as follows: the section 2 of this article reviews the existing approaches for personal multimedia annotation, sharing, organization and retrieval. In section 3, we give an overview of the context-based management approach we propose. Section 4 details how our semantic annotation approach is extended with a new notion of context, the user’s inference rules, and some qualitative spatial relations. Section 5 presents our method for semantic indexing of personal multimedia documents. In section 6, we present our index expansion approach for enriching the semantic context (mostly spatial) of the document. Section 7 gives a description of our approach for assisting users in multimedia document sharing. In section 8, we describe our implementation and our experiments for multimedia document retrieval. Section 9 gives some conclusions and defines several directions for future improvements of our approach.
2 Multimedia management 2.1 Annotation and organisation In order to organize their collections of personal multimedia documents, especially photos and videos, the majority of the users first change the name of these documents in order to facilitate future search. Sometimes, they group their documents in file directories which may correspond to time periods (e.g., mars 2009), events (e.g., a voyage, a conference) and places (e.g., photos and videos taken in Grenoble) [22]. However, this type of elementary organization does not improve research significantly when the number of multimedia document increases. Desktop image and video tools, such as Picasa, Iphoto and ACDSee1, tries to ease the organisation task using tools for organising automatically the documents by date/time attributes However, only temporal organisation is insufficient for most of the users [25]. Another solution to facilitate organization and retrieval processes is the use of annotations. Document annotation consists in associating metadata with a document. Annotation highlights the significant role of personal multimedia documents in restoring forgotten memories of visited places, events and people [34]. Moreover, the use of annotations allows the development of more efficient personal multimedia management tools. For instance, geotagging personal photos and videos allow visualizing them in mapbased interfaces such as Flickr Map2. Researches in the field of multimedia annotation can be divided into two main categories: context-based and content-based. Furthermore, multimedia annotation
1 2
http://www.acdsee.com/ http://www.flickr.com/map/
394
Multimed Tools Appl (2011) 53:391–429
approaches can be also classified according to the way they are performed: manual, automatic and semi-automatic approaches. 2.1.1 Content-based annotation Content-based annotation describes what is depicted in the multimedia document, like, for example, the objects and people that appear on a photo (e.g., Anne and her cat), the user’s activity captured on a video (e.g., she plays soccer) [32]. Most tools, such as Picasa, Flickr and Youtube, allow the manual association of spatial information (location tags) and freetext keywords with documents. Other applications, such as [6, 21, 33] offer a more powerful metadata description by using semantic annotation representations. For instance, Vannotea [33] has been developed by the University of Brisbane, allowing distributed users to add metadata to MPEG-7 (video), JPEG2000 (image) and Direct 3D (mesh) files, with the mesh being used to define regions of images. In spite of the obvious advantages of multimedia metadata, manual annotation requires a time-consuming effort from the user. Automatic content-based approaches have emerged in order to free users from the task of annotating images and video documents. An initial multimedia database is created with pre-annotated images and videos. When one adds a new document, these systems exploit low-level visual features (e.g., colours, textures, shapes) and use algorithms to identify similarities between the new document and the documents stored in the database. However, there is often a semantic gap between the annotations recommended by these systems and the user desired annotations [1, 36]. Some approaches avoid this problem by proposing semi-automatic annotation methods. For example, M-OntoMat-Annotizer [4] intends to support manual annotation of image and video data by automatic extraction of low level features that describe objects in the content. In the work [17], the system detects regions (i.e., corresponding to objects detected within the image) and allows users, in a formal way, to describe spatial relations between image zones (e.g., the region A is onto:on-the-left-of region B). 2.1.2 Context-based annotation In order to overcome the shortcomings of manual annotation and the semantic gap problem, some automatic context-based annotation approaches have been proposed. Dey and Abowd [8] define context as “any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant for the interaction between a user and an application, including the user and the applications themselves”. In the case of multimedia applications, the main idea is to characterize the user’s context when she produces a multimedia document. Indeed, a large part of the users’ manual annotations contain words related to the context of creation of a multimedia document, such as names of places and events [1, 26, 27]. Moreover, with the possibility to explore camera phones that contain built-in sensors (e.g., GPS), information about the position of the user can be automatically captured and enriched [2, 12]. Many systems have successfully used this approach, especially for photo organization, publication and visualization. Examples of such systems are PhotoCompas [25], Life Blog [1], ZoneTag [3], MediAssist [27], PhotoCopain [35], and PhotoMap [38]. Most of these systems make use of captured spatial and temporal contexts (e.g., GPS and creation date/ time) to automatically organize photo and video collections, and to offer context-based interfaces that allow users to navigate through their document collections. Some of these systems use contextual metadata to suggest keyword annotations (i.e., tags) that could
Multimed Tools Appl (2011) 53:391–429
395
describe the documents. They access Web resources such as gazetteers, weather forecast services and social network profiling to infer rich information about the creation context (e.g., address location, weather, light status, nearby objects, people presence). In addition, other systems, like PhotoCopain [35], Life Blog [1] and MediAssist [27], go a step further in combining contextual metadata and content analysis to infer annotations about the photo content (e.g., indoor/outdoor classification, face detection, event detection). 2.2 Multimedia retrieval 2.2.1 Content-based multimedia retrieval In content based multimedia approaches, document similarity is inferred from the comparison of low-level features automatically extracted from the documents. In contentbased image retrieval, for example, image similarity is based on visual low-level features like colours, textures and shapes. Content-based retrieval of audio uses features like pitch, amplitude, and frequency of audio sequences as a basis for comparison. Content-based video approaches typically merge features used for audio and for image documents, adding to the temporal dimension. The typical usage for these approaches relies on query by examples searches. The user provides, for instance, a video document, and the system responds with similar videos (from a visual, audio and temporal point of view). In spite of the latest advances in content-based multimedia retrieval systems, (as for instance, relying on domain-ontologies for the query-matching process), it is still difficult to bridge the socalled “semantic gap” problem [20]. In fact, visual or acoustic similarity is not semantic similarity. There is a divergence between low-level features and high level semantic meanings. Visually similar images or video sequences might not represent the same meaning as expected by users when they provide an image as a query. In addition to that, the assumption that users are always able to formulate an appropriate query is questionable [15]. Some content-based approaches try to extract higher level semantic objects out of low level features. This is the case, for instance, with text retrieval approaches that use speech recognition in order to extract higher order meaning from audio documents [30]. Unfortunately, these approaches have a limited applicability and cannot be generalized for all categories of personal multimedia. 2.2.2 Keyword-based multimedia retrieval In order to circumvent the limitations of content-based retrieval, keyword-based approaches are used to describe image and video content and to formulate queries according to the users’ comprehension of this content. Images and videos are annotated by using a set of terms and users specify their query needs as a logical combination of several keywords. Generally, an exact matching process is performed in order to return a ranked list containing the most similar documents. Similarity, in this case, is measured between document terms and query keywords. We distinguish between two kinds of keyword-based multimedia search engines: General search engines, such as Google (with its more specialized engines Google Image3 and Google Video4) and PicSearch5, which index documents by using their properties (e.g., name, URL), the textual data surrounding the document in the Web page 3 4 5
http://images.google.com http://video.google.com http://www.picsearch.com
396
Multimed Tools Appl (2011) 53:391–429
and textual information extracted from Web links that point to the document. A query with these tools contains a few combined keywords and, occasionally, some filtering options. In spite of the huge number of indexed documents the search results presented to the users frequently contain unwanted results [42] since it is still difficult to extract text terms that are really semantically related to the documents. Collection-based search engines (or CBSE), such as Flickr6 or YouTube7 give much more consistent and relevant results when compared to those presented by General Web Search Engines [42]. This difference occurs because, generally, CBSE index their documents by using some structured, automatically produced document metadata (e.g., the type of the device on which the document was produced) and keywords (added manually by users as tags), instead of using textual content potentially related to the document. Still, we identify two main drawbacks of today CBSE systems: manual annotation and syntactic matching. In spite of the user’s interest to tag their multimedia documents in order to facilitate their indexing and to share them with other users [3], manual annotation is very time-consuming, especially when users have to tag large collections. Moreover, a time lag often occurs between the document creation time and the moment of annotation. As a consequence, users could add mismatched tags producing noise in annotation. Another possible disadvantage of CBSE is the way these systems match the users’ queries with documents tags. Generally, these engines fail when purely syntactic comparisons between the user’s keywords and the indexed metadata give negative results [27]. For example, if a user searches for photos of “capital of France”, all the photos which are annotated with “Paris” will not be returned since only syntactic comparison is performed. Another problem occurs when the user’s request is not sufficient to solve ambiguities, especially, when a research is formulated with double meaning words, such as “April” (Is it a person or the month?) or “Paris” (Is it the city or the Paris Saint German football team? If it is the city, in which country: France, USA or Canada?). 2.3 Multimedia sharing One of the user’s motivations for creating personal multimedia such as photos, videos is to keep a trace of a situation or an event that they can share afterwards. The popularization of blogs and Web 2.0 multimedia systems (e.g., Flickr, Facebook) has stimulated this captureto-share behaviour. Modern mobile devices also contribute to this phenomenon by letting users create multimedia and share it in a ubiquitous way. For instance, users can use the ShoZu mobile application for publishing text, photos, and videos in the most popular online social communities (e.g., Flickr, Blogger, and Twitter) directly from their mobile devices. Other commercial applications, such as Nokia Life Blog and Iphone MobileMe Gallery, have similar functionalities. They offer many sharing methods: email, MMS, Web publishing and Bluetooth-based messages. As mentioned before, mobile devices, with their built-in sensors and personal data sources, are able to capture the user’s context. Some mobile and context-based approaches have emerged in order to improve the sharing of multimedia documents. Zonetag [3] and MMM [32] exploit the information about the user’s context (e.g., GSM location and time) when the user shares a document. Each time a user uploads a photo into the system, his context and the manual annotation added to the photo are stored. When a user wants to publish a new photo, the system matches the user’s current context with the contexts stored 6 7
http://www.flickr.com/ www.youtube.com/
Multimed Tools Appl (2011) 53:391–429
397
in the database. Hence, the system can propose the tags that other users, or even the same user himself, have added in the same situation (e.g., same location, same day time). Some research projects have focused on automatically context-based MMS and Blogs messages. For example, the Aware project uses ContextMedia [28] mobile application for sharing photos and messages annotated automatically with the user’s context. Aware generates a MMS of a photo with the user’s location and the name of nearby people and objects detected by Bluetooth scanning. ContextWatcher goes a step further, letting users create life blogs automatically. ContextWatcher can be configured [19] to post with a predefined frequency (e.g., each two hours) a digest of their context (i.e., location, time spent in the location, activity, humour, weather). Occasionally, the system publishes the multimedia created by the user during the journey. An example of automatic life blog is available at http://koolwaaij.blogspot.com/. User’s context information can also improve the multimedia-delivering process (i.e., who is the recipient and when the multimedia will be shared). For instance, the CAMM (Context-Aware Mobile Messaging System) is a Java-based architecture for generating context-sensitive mobile messages (MMS, SMS). CAMM allows users an enhanced level of control over their messages, allowing them to specify context-aware requirements for delivering a message. For example, a MMS sender decides that only other CAMM users located in the same place as him can receive the message. The CAMM server will then calculate which users fulfil this context-aware requirement. The co-location is determined the system by scanning for nearby Bluetooth devices.
3 Context-aware management of mobile multimedia 3.1 Overview of our proposition Our strong belief is that problems concerning the management of personal multimedia underlined in the previous sections can be tackled by combining mobile computing and semantic technologies as we illustrate in Fig. 1.
Fig. 1 Context-awareness and semantic technologies for improving management of personal multimedia
398
Multimed Tools Appl (2011) 53:391–429
Present day mobile phones are more than simple communication devices. They have evolved towards genuine mobile multimedia studios. Mobile users create short videos, audio recordings, and photos with a quality similar to popular “point and shoot” digital cameras. These devices also allow users to write text documents and directly publish them on personal blogs or on the Internet, without using any other additional devices. This new trend in multimedia creation is obvious if we look at the increasing number of photos and videos taken by mobile devices, such as the Apple IPhone or the Nokia N95, that are published on the Web site Flickr (see the statistics at http://www.flickr.com/cameras/). Moreover, the growing number of sensors which are embedded in mobile devices (e.g., GPS, RFID readers, compass) and the installed sensor-based applications allow the acquisition of important quantities of low level data concerning the user’s context during the multimedia document creation [37, 43]. Low level data such as geographic coordinates, date, and nearby devices can be enriched by using available data from the Web and semantic reasoning technologies. For instance, the user’s address and nearby points of interest can be derived by using Gazetteer services and GPS data. Or, as shown in [38], the presence of other people in proximity can be inferred by combining data gathered by Bluetooth with social network profiles described using ontologies. An enriched description of the users’ context when they created a multimedia document can overcome the limitations of personal multimedia management tools. People often remember their multimedia documents using clues related to the circumstances in which they were created, such as names of places, monuments, seasons, week days, events, activities, and acquaintances at their side. People use these indications for annotating, organizing and searching their multimedia documents [25]. When a user shares multimedia documents, part of these clues can also be employed for describing the multimedia document. For instance, when a user sends a MMS to a friend with a message saying where she is and what is she doing at the moment. Figure 2 illustrates our approach. The main idea is to acquire the largest amount of available information describing the user’s context when she creates a multimedia document. The very first objective here is to reduce the time spent on the rather boring task of manual annotation, by automatically deriving useful annotation. The process of multimedia context characterization begins with the acquisition of sensor data by the mobile device. The second step is the enriching of this information which is achieved with three processes: i) context interpretation, ii) social and spatial inference, iii) user’s rules inference. During the interpretation process, our system accesses available Web data sources, such as gazetteers, weather forecast services and databases containing spatial descriptions for points of interest, in order to transform the low-level context data into high-level context data. The extended context information derived from this process is represented by using annotation ontology. The process of social and spatial inference employs this ontology and some semantic reasoning mechanisms for improving the user’s context data set with inferred spatial, temporal, and social information. For instance, the spatial relationships between the user’s location and a point of interest can be inferred through the spatial reasoning process. The last step of improvement of context information is applying the user’s inference rules. In our system, users specify rules in order to enrich the context information. For example, users can describe the address “1, avenue du success, Paris, France” as their home address. Thus, the inference process executes all the predefined user’s rules for deriving more high level information (e.g., Andy is at home). All the derived context information is attached as metadata to the multimedia document.
Multimed Tools Appl (2011) 53:391–429
399
Fig. 2 Overview of our approach
Once the contextual metadata are enriched, the multimedia management tool uses the spatial and temporal information (i.e., geographic coordinates and data/time) in order to automatically organize the collection of multimedia documents. The system creates spatial and temporal indexes. These indexes allow users a quick navigation of their collections via map-based interfaces and/or document lists ordered by date/time. The multimedia management tool exploits metadata for the generation of multimedia annotations. The high-level context information is presented to the user for annotation validation purposes. Some content-based classification mechanisms can be introduced at this stage in order to augment the multimedia document metadata (e.g., if the photo was taken indoor or outdoor). From the contextual metadata, the system generates semantic tags, used to index the multimedia documents. In contrast with traditional tag-based approaches, our indexing approach supports a semantic search, based on content and context annotations. We propose here an adaptation of the Vector Space Model (VSM) [5] that exploits these annotations in order to overcome syntactic comparison limitations by incorporating explicit semantics in the retrieval process. For each multimedia document, we extract from its metadata a set of tags for indexing them. In our approach, each index term is composed of a word and a semantic stamp that represents the relationship between the content of the multimedia document and the annotation (e.g., tag = “Paris”, semantic stamp = “locatedin”). The main idea is to use the Vector Space Model for computing the query-document matching score without losing the semantics of each keyword. With the semantic representation of the metadata, we expand the index terms by using spatial similarity measures in order to find other terms potentially related to the multimedia data capture context. The multimedia management system also exploits the enriched metadata for improving its sharing services. When a user want to share a document, the high level metadata can be used for generating
400
Multimed Tools Appl (2011) 53:391–429
semi-automatic MMS and for notifying the user’s friends that a new multimedia document was published on the Web. 3.2 From location and time to context As mentioned before, our team has previously developed a context-based photo annotation system called PhotoMap [40]. Unlike the other approaches, PhotoMap is both a mobile and a Web annotation system that uses contextual metadata for organizing photo collections. The mobile application allows a user to define spatiotemporal photo albums that are automatically associated with an ordered list of track points (i.e., timestamps and GPS coordinates). In addition, when a user takes a picture with a mobile phone camera, the J2ME client captures the photo shot context (i.e. geographic position of the device, date and time, Bluetooth addresses, and the camera properties) and associates this information with the photo. The user can also manually add tags directly in the device and avoid the time lag problem of manual annotation. Another particularity of PhotoMap is the way photo annotations are represented. Camera systems or mobile applications that take photos are generally completely independent from those we use for organizing and searching photos. A common vocabulary is required to describe photo metadata allowing the communication between these systems. In our system, we have built an OWL-DL ontology called “ContextPhoto” for photo annotations, in which we represent contextual image metadata using five contextual dimensions: spatial, temporal, spatiotemporal, social and computational. By using an ontology approach, we can improve both annotation and retrieval processes [13]. The server enriches the photo annotations when the photos are uploaded to it. The server combines semantic reasoning and Web data sources in order to infer other contextual information such as: nearby user’s friends, weather conditions, address information and nearby objects of interest (e.g., monuments, famous places, etc. which are georeferenced in articles found on Wikipedia). The PhotoMap server organizes the user’s photos using some automatically acquired spatial and temporal data and it provides users with interfaces for photo browsing (e.g., a Google map-based interface). In addition, PhotoMap improves the users’ recall of their photos showing inferred spatial, temporal, and social information. Figure 3 shows the annotation
Fig. 3 Enriched annotation showed on the PhotoMap Web page
Multimed Tools Appl (2011) 53:391–429
401
information presented on the PhotoMap Web system when a user selects a photo on the map. For instance, the “Who” tab shows the nearby people detected when the photo was taken and their social relationship with the owner of the mobile phone (i.e., friend or friend of a friend). A demonstration of PhotoMap is available at http://pyro.imag.fr/PhotoMap/. In the next sections, we show how we have extended our photo annotation approach towards a multimedia annotation tool and how we have integrated context-aware retrieval and sharing services not present in the original version. 3.3 Towards context-aware retrieval and sharing Most of the previously mentioned systems, including PhotoMap, do not consider contextual metadata during the retrieval and sharing processes. In the case of the retrieval process, only the MediAssist system goes towards the retrieval phase [27]. It offers a text-based retrieval interface and the inferred image annotation is transformed into keywords index for each image [27]. The main drawback of this approach is that it does not consider the semantic similarity of query terms during the matching process. Only syntactical matching is performed. Generally speaking, since the users may be trying to retrieve multimedia documents that have been created a long time before the search, they tend to remember little information about the context (like the day period, the city location and the season), or more general information (like the region visited or the commonly referred name of a place). In addition, when users remember some information about the context of multimedia creation, the latter can be partially incorrect or different from the annotation information automatically generated by the annotation system. In these types of queries, syntactical comparisons fail. For example, if a video was taken at the end of winter, it may happen that the user asks for this document, under the impression that it was taken in spring. Another factor of imprecision in queries is due to the users’ lack of knowledge about the creation context of the multimedia document. For instance, if a photo was taken in a suburb town (e.g., Saint-Denis) of a bigger one (e.g. Paris), then users would try to retrieve such a photo with the location “Paris”. Hence, they will get a lot of unexpected images, since in context-based photo systems, the exact location “Saint-Denis” is extracted and inferred by using the GPS coordinates and gazetteer Web services. One of the solutions for overcoming the problems raised by the syntactic matching is to extract candidate keywords from the multimedia annotation and to store together with these keywords the semantic relations that hold between them and the multimedia document. The task of extracting the semantics of multimedia annotation is facilitated in automatic contextbased annotation systems since, unlike simple tag-based systems, the metadata in these systems are commonly represented in a formal way (e.g., EXIF, RDF). Metadata have an explicit segmentation of the relationship types between annotation and multimedia documents. For example, with our ontology ContextPhoto, we can formally describe that the annotation “Stade de France, Saint Denis, France” represents the place where a photo was taken. With semantic keyword-based indexes derived from this segmentation, we allow users to create keyword-based queries that explicitly specify which keywords are about the context and which are about the content. In the case of the sharing process, only a few systems consider the contextual metadata, and, when they do, they use only low-level contextual information (e.g., temperature, address). We argue that high-level contextual metadata, combined with semantic technologies, improve both the delivering of the multimedia document and the generation of messages attached to them.
402
Multimed Tools Appl (2011) 53:391–429
Let us describe a scenario of multimedia retrieval and sharing that illustrates the usefulness of the semantics of context-based annotations. A mobile user, Andy goes to the Stade de France (i.e., a stadium in the suburb area of Paris) with his friend Bob. Andy was at 295 m from the stadium when he took a photo with a context-based annotation system, such as PhotoMap, that automatically tagged his photo. The annotation system transforms the GPS coordinates, by accessing a gazetteer service, into an address description (e.g., Avenue du Géneral de Gaulle, Saint-Denis, France). The annotation system detects that Bob’s Bluetooth device is nearby. Andy manually adds the tag “stadium” as a content tag and he publishes the photo on his personal space of a collection-based photo site. Later, Andy would like to share this photo of the Stade de France with Bob. He initiates his search giving “stadium” and “Paris” as query keywords. The system then retrieves photos of the Stade de France mixed up with photos of Paris stadiums but also stadiums where the Paris Saint German football team has been playing. If the system uses the text-based retrieval approach proposed by MediAssist the results will not be better since the term “Paris” does not belong to the photo index terms. Now let us assume that the multimedia management system has added content and context photo metadata during the indexing process. This system can thus propose advanced query interfaces for exploring the entire metadata. For example, Andy can specify that the keyword “Paris” indicates a location tag, or that “stadium” represents a content tag. The search term “Paris” being a geographical concept, a spatial similarity measure between the place name and the location metadata of the photos can be calculated by combining neighbour’s connectivity and geographical distance as closeness measure (e.g., Saint-Denis is close to Paris). Andy’s photo will now be correctly ranked. In some situations of sharing, the search of the multimedia is not even necessary. Having inferred the enriched social information annotation, the system could propose to Andy to send a copy of the photo or the photo’s URL to all Andy’s friends detected as nearby people on his photos. Then, in our previous example of the Stade de France photo, the system could ask Andy, when he tries to publish his photo, if he does not want to send it, just after the shot, to Bob.
4 Annotation extension In order to represent context elements in different application domains, and particularly for contextual multimedia annotation, we have defined in [40], an OWL-DL ontology, called Context Top. Figure 4 illustrates this top level ontology, which is extended here with qualitative spatial relations (Fig. 6).
Fig. 4 Fragment of the Context Top ontology defined in (Viana et al. 2008b)
Multimed Tools Appl (2011) 53:391–429
403
The Context Top ontology is centred on the Context concept, which is linked to a set of Context_Element instances (i.e., using the hasContextElement object property). The main idea is that the context describes the situation of a user’s action (e.g., an audio recording) during an observation interval Δt and it is composed by a group of context elements. The Context_Element concept has five specializations which describe respectively social, computational, spatial, temporal and spatio-temporal context elements, observed by the context-aware system (for more details, see [40]. Context Top can be specialized for representing both the user’s interaction with the system and the user’s situation when she creates a multimedia document. In our previous work, we proposed an ontology that extends Context Top, called ContextPhoto [40]. ContextPhoto provides a vocabulary for representing annotation in context-aware photo collections. As a first new contribution, we have extended ContextPhoto towards an ontology for representing the user’s situation when she creates a personnel multimedia document with a mobile device. Figure 5 shows this new ontology called ContextMultimedia. As mentioned before, we address here the management of personal multimedia produced in mobile situation. Hence, in contrast with more complex multimedia annotation vocabularies, our annotation representation considers the multimedia document as an indivisible object even if we can associate with this document other types of content annotation (e.g., MPEG7). We made this choice since we are more interested in the description of the creation context of a multimedia document. ContextMultimedia contains vocabularies for attaching to multimedia content both manual and context-based annotations. Now, for example, we can characterize a video with a manual tag “football” and with the user’s context, when she starts recording it and when she finishes it. 4.1 Qualitative spatial relations The other main extension of our annotation approach is the integration into the Context Top ontology of concepts for representing qualitative spatial information. In the remainder of this section, we describe some reasoning techniques for handling this new annotation information. The notion of place is one of the most useful information for evoking a personal multimedia document [25]. However, the notion of place has many meanings that change according the people [7]. For instance, a tourism user can remember the name of the city or the topology relation between a point of interest and her position during the multimedia creation (e.g., inside the Stade de France). For representing the various aspects of spatial annotations, we have studied existing works on ontology based representation of quantitative and qualitative spatial information. In this direction, an assessment study comparing 45 geospatial and temporal ontologies, from the annotation, qualitative reasoning and information integration points of view, has recently been published [29].
Fig. 5 ContextMultimedia ontology
404
Multimed Tools Appl (2011) 53:391–429
Following recommendations of the authors we have chosen the GeoRSS-Simple OWL encoding, as a reference spatial features ontology, which we integrate into our own ontology of qualitative spatial relations. GeoRSS-Simple is a very lightweight format for spatial data that developers and users can quickly and easily integrate into their existing applications and use for expressing semantic annotations with little effort. It supports basic geometries (point, line, box, polygon...) and covers the typical use cases when encoding locations. Given the fact that GPS data we deal with is represented as pairs of longitude and latitude coordinates and the most complex geometrical description we handle are polygons describing GeoReferenced_Objects, we only use the fragment of the GeoRSS ontology illustrated in Fig. 6. The link between this ontology and the Context Top ontology is realized by the Spatial_Element concept, defined as a specialization of the gml:_Feature concept. An instance of the Spatial_Element class can thus have a name, specified using the gml:featurename data property, and can have a spatial extent described using an Address or a _Geometry. We have identified two types of Spatial_Elements: 1) point of interest (POI) which are static locations, modelled using the Georeferenced_Object concept, and 2) location of multimedia data capture, which represent the concrete localisations of the mobile device at the moment when the multimedia was created. The later are represented using the Place concept, related to the MutimediaCreation concept using the relation hasLocation. In order to represent in OWL qualitative spatial relations that can be handled by our context reasoner, we propose the use of the QualitativeSpatialRelations ontology. It contains three generic types of spatial relations: Direction, Topology and Distance, modelled by using object properties. They can be defined between all types of spatial descriptions: _Geometry, _Feature and/or Address. The Direction relation has eight specializations (NorthFrom, SouthFrom, EastFrom, WestFrom, NorthEastFrom, NorthWestFrom, SouthEastFrom, SouthWestFrom) which allow the expression of direction relations existing between _Point objects. In order to model direction relations between a _Point instance and a _Polygon instance, the relations NorthOf, SouthOf, EastOf, WestOf, NorthEastOf, NorthWestOf, SouthEastOf, SouthWestOf, CenterOf are used. Two specializations of the Topology relation have also been defined: Inside and Outside. They define the position of a _Point instance with respect
Fig. 6 The general OWL ontology defining location description and spatial relations for instances of Spatial_Element concept
Multimed Tools Appl (2011) 53:391–429
405
to a _Pologon instance. Border points are considered as being inside the reference polygon. Usually, Distance relations are harder to represent since they come with attributes specifying, for example, the metric system employed when calculating the distance (Euclidian distance, shortest road, drive time, etc.), the scale, or the distance type when regions of space are described (between the average border points, gravity centers, administrative centers, etc.). Nevertheless, given the application domain (personal multimedia annotation) we have simplified their definition by exclusively considering distances between points described using a unique scale ( l, we generate a new term text as the combination of the instance name of cext and the stamp locatedin. We can then define SText(mdi ) = {t1, t2,..., tk} as the set of the stamped terms text derived from this process. 5.2 Weighting terms After the construction of the sets ST(mdi), CloseST(mdi), VeryCloseST(mdi), InsideST(mdi), OutsideST(mdi), DirectionST(mdi), LocatedinST(mdi) and SText(mdi), we begin the ! creation of our inverted indexes using the Vector-Space Model. Let V ðmdi Þ ¼ wt1 mdi ; wt2 mdi ; . . . ; wtjT j mdi be the |T|-dimensional weighted vector of the multimedia document mdi where wt md is defined as the relevance weight of the term tj to the multimedia document mdi. This weight is computed as:
wtj mdi
8 idf t ; t 2 ST ðmd ðmdi Þ > > i ÞnOutsideST < j j idf tj InvDist tj ; mdi ; tj 2 OutsideST ðmdi Þ ¼ > idf tj SimSPATIAL ðc; cext ; mdi Þ; tj 2 SText ðmdi Þ > : 0; otherwise
ð1Þ
idf(tj) represents the inverse document frequency of the term tj. It is computed as the IDF for text-based document retrieval: idf tj ¼ log jPj=ftj , where |P| is the size of the corpus and ftj computes how many times the term tj was used to index a document. If a term tj is generated from a point of interest (i.e., ti ∈ CloseST(pi) or ti ∈ DirectionST(pi)) , for example a Wikipedia entry, the idf(tj) is multiplied by a penalizing factor InvDist(ti,mdi) that is inversely proportional to the Euclidian distance betweenthe mobile device coordinates and the point of interest(i.e., InvDist tj ; mdi ¼ 1=1 þ dist tj ; mdi ). This weighting method follows the assumption that the closer an object is to the user’s position, during the
Multimed Tools Appl (2011) 53:391–429
411
multimedia creation, the more relevant will be its weight for indexing the multimedia document. When the point of interest has a polygon description, if the multimedia was created inside its borders (i.e. inside the stadium, inside the park ...), the distance between the tj and mdi is considered 0. Then, the InvDist function cannot be used as a penalizing factor. If the multimedia was created outside the borders of the point of interest, the InvDist function uses the point representation of tj. For the expanded terms (i.e., ti ∈ SText(mdi )), the weight wtj mdi is the result of the multiplication of idf(tj) by the similarity score of the instance cext that originates tj. 5.3 Query and matching processes In our query model, users specify the semantic relations between the requested multimedia document (e.g.,, the personal videos or photos they try to retrieve) and the query keywords, by using labelled text fields (i.e., what, where, who, when). Advanced options are also available for allowing users to inform more refined relations such as “people on the photo” or “videos recorded close to an object”. A query is, then, a set of pairs Q(qi) = {,..., } where n is the number of keywordskj, and srj is itsrespective semantic relation. ! We transform Q(qi) into a vector V ðqi Þ ¼ wt1 qi ; . . . ; wtjT j qi where wtj qi represents the discrimination power of a term on the query for discerning relevant from irrelevant ! documents. In order to construct V ðqi Þ, the first step is to create the set of terms ST(qi) = {t1, t2,..., tn} from the tuples Q. Since we have indexed our photos with terms generated from the combination of a word and a semantic stamp, then, for each keyword kj ∈ Q, we use its semantic relation to derive the related semantic stamp. Once the term generation is ! finished, we compute the weights of V ðqi Þ as: 1 Desc tj ; tj 2 ST ðqi Þ ð2Þ wtj qi ¼ 0; otherwise Desc(ti) is a real number in the range ]0,1] for expressing a discriminating factor of a term among the others on a query. The objective is to offer ways for establishing priorities among query terms. Studies of user’s behaviour while searching for multimedia documents have revealed that the most important clues for remembering such documents are “who”, “where”, and “when” in that order [25]. Hence, the discriminating factor could be used to express this priority order. After the computation of the query weights, we calculate the score for each multimedia document in the corpus. The score is calculated by using the cosines similarity: ! ! V ðmdi Þ V ðqi Þ Scoresðmdi ; qi Þ ¼ ! ! V ðmdi Þ V ðqi Þ
ð3Þ
6 Index expansion process 6.1 Spatial expansion !
As stated previously, each document is represented by a vector V ðmdi Þ ¼ ðwt1 mdi ; wt2 mdi ; . . . ; wtjT j mdi Þ, where wtj mdi represents the weight relevance of the annotation term ti in the document mdi. If we consider Andy’s example, the annotation terms produced are: Stadium.what, Saint_Denis.locatedin, France.locatedin, Europe.locatedin, Stade_de_France.nearby, Sunset.
412
Multimed Tools Appl (2011) 53:391–429
when, Spring.when, 2008.when, May.when, and Bob.who. The position of each of these terms in the document index vector gives the corresponding weight relevance calculated according to the Formula 1. In the classical approaches, all the other weights of terms are set to 0, which means that these terms are not relevant for indexing the document. In our proposal, we consider that some of the other terms may be relevant to the photo, if they have some degree of similarity with the document context. We proceed this way in order to alleviate the problem of imprecision in the user’s information needs. Spatial terms that we consider for term expansion are those with the stamp locatedIn. In order to expand the spatial terms in the document context, we propose to calculate the relevance weight of the potentially related terms, using two spatial criteria: semantics and distance. The global similarity is calculated as: SimSPATIAL ðc; cext ; pi Þ ¼ q SemSimðc; cext Þ þ ð1 qÞ
1 DistSimðcext ; pi Þ þ 1
ð4Þ
where θ and 1-θ represent the relevance weight of the semantic and distance similarities respectively. The semantic dimension of the similarity accounts for the semantic closeness of spatial terms like: equivalence, inclusion, neighbourhood and containment, whereas the distance is used to calculate the physical distance between two places [18]. In order to consider the semantic relations, we need to use a spatial ontology, which represents the geographical concepts of a region, their properties and their relations. The manual construction of such ontologies is a long time and energy consuming process. However, recent works have been successful in the automatic generation of spatial ontologies [5]. We can suppose that geographical ontologies will be more and more available in the future. In our approach, the geographical ontology is based on the GeoNames model8. We define a generic model that we instantiate according to the available geospatial data. This model (see Fig. 9) represents the geographical concept “Place”, its properties (name, equivalent Name and Geometry), and the relations that may exist between places (e.g.; neighbour, partof and capitalOf). We start our similarity computation by searching in the spatial ontology the concept c that generates the term c.located_in of a document. If we find c in our ontology, we begin the computation of the spatial similarity between c and the other concepts having the same type (e.g., region, city). The steps of the process are:
& &
We use the spatial ontology in order to retrieve the known equivalent or alternative names of a place. The alternative names will generate terms with the same relevance weight of c.locatedin (i.e., SimSPATIAL(c,cext,pi)=1). We use the spatial ontology instances and an SPARQL9 query in order to retrieve the set of places potentially related to the place c. The close places are either the neighbours of c or belong to the same geographical unit of c. With the retrieved concepts, we create a set of potentially related concepts Rc.
In the next steps, our goal is to reduce the initial set of related concepts and to keep only the most similar ones. To do so, we calculate the spatial similarity between the initial concept c and cext ∈ Rc. Spatial similarity is calculated as follows: For each cext ∈ Rc, we calculate the semantic similarity SemSim(c, cext), by using an adaptation of the Wu & Palmer measure [20]. This similarity is asymmetric if one of the spatial 8 9
http://www.geonames.org/. http://www.w3.org/TR/rdf-sparql-query/
Multimed Tools Appl (2011) 53:391–429
413
Fig. 9 Spatial ontology
concepts is the administrative centre of a geographical unit; i.e. SemSimðc; capitalÞ ¼ 6 SemSim ðcapital; cÞ. SemSim ðc; cext Þ ¼
2 depth ðlsbÞ depthlsb ðcÞ þ depthlsb ðcext Þ þ a þ b
ð5Þ
lsb is the lowest super bound of the two concepts c and cext in the ontology. The term depth(lsb) represents the distance that separates the lsb concept and the top ontology concept. The terms depthlsb(c) and depthlsb(cext) represent respectively the distance between the concepts c and cext and the root concept hierarchy. The factor α is introduced in order to increase the similarity between two neighbouring spatial terms. Logically, the similarity between two neighbouring cities (a =0) has to be higher than the similarity between two distant cities (a >0). We introduce factor β in order to give more importance to the capital of a geographical unit (β=0), based on the assumption that the region capital is generally better known by users. Otherwise, β>a .For each cext ∈ Rc, we calculate the physical spatial similarity, based on the DistSim(pi,cext). In order to be more efficient, this measure should consider the travel time between the two places. However, this approach depends on the availability of such information. In our work, we calculate the Euclidian distance between the document coordinates and the boundary of the place cext. After the calculation, we keep only the concepts for which the similarity is higher than a given threshold l. 6.2 Spatial expansion examples Table 2 illustrates some of the initial concepts which are indexes of Andy’s photo and the expanded concepts we have kept after the process of index spatial expansion.
Table 2 Index expansion Initial terms Saint-Denis
Expanded Concepts
Similarity
Paris
0,619
La Courneuve
0,545
414
Multimed Tools Appl (2011) 53:391–429
In order to calculate the spatial similarity, we use a part of the European geographical ontology (Fig. 8). First, we have generated the set of related concepts of Saint-Denis: Rc = {Paris, Vincennes, La courneuve}. After, we have calculated the spatial similarity between Saint-Denis and each concept of the set Rc. We give an example of the calculation in the case of the concept Paris. The values used are: α=1, β=2 and θ=0,65 (see Formula 4 and 5). SemSimðSaint Denis; ParisÞ ¼
2 5 5þ5þ0þ0
¼ 0;8:
Dis tanceðP; ParisÞ ¼ 2;5 km ) SimDistðPi ; ParisÞ ¼
1 2;5þ1
¼ 0;285
Simspatial ðSaint Denis; Paris; PÞ ¼ 0;65 0;8 þ 0;35 0;285 ¼ 0;619 Simspatial ðSaint Denis; VincennesÞ ¼ 0; 428 SimspatialðSaintDenis; CourneuveÞ¼0;545 The following step is to select only the concept places in which the similarity is higher than the threshold l, which we set to the value 0.5. Hence, the city Vincennes will not be included in the index of the photo P. However, Paris and Courneuve are potential spatially related concepts.
7 Context-aware sharing services In the section 2.3, we have presented systems that exploit the user’s context information in order to improve multimedia sharing. Most of them are focused either on annotation recommendation, or on automatic MMS generation, or on context-sensitive message delivery. Here, we propose to integrate these three functionalities by using our semanticbased approach. In contrast to these other proposals, we integrate high-level contextual data in the multimedia sharing process. In our system, users can share a multimedia document by MMS and Bluetooth. They can also publish it on the Web from their mobile devices directly. All these sharing services take advantage of the contextual information we produce based on the interpretation and the inference rules, as we show in the next subsections. 7.1 Context-based sharing services Our application offers three context-based sharing services: 1) on device contextual-aware annotation, 2) on device context-aware notification, and 3) on device context-based MMS generation. The mobile service of context-aware annotation is started when a user wants to publish a multimedia document on the Web. The goal of the service is to bring to the user’s device the result of the annotation process. The mobile application sends to the server a description of the local gathered data. The server executes the interpretation and inference processes and it sends the enriched annotation to the mobile device letting users view and validate the suggested annotation. Figure 10 shows some screenshots of the mobile application with the suggested annotation inferred by the server. In order to show the location where the multimedia document was created, we have used the Yahoo! Map service. The context-aware notification service provides users with interfaces to configure the system in order to notify other users that a new document was published. The notifications will be triggered only if some context-based conditions are validated. These conditions act in a similar manner to the context-based requirements proposed in the CAMM system
Multimed Tools Appl (2011) 53:391–429
415
Fig. 10 Screenshots of the mobile application
(Section 2), except that, in our application, we can use high level context information for expressing them. For instance, a user can specify that all photos taken at her home will be sent to her friends detected by the system as nearby to the photo shot location. As for the inference rules, the conditions of the context-based notifications are represented in SWRL. The rule body contains the context-based conditions to be satisfied. It can use any of the context elements and attributes of the ContextMultimedia ontology. The implication defines the users who should be notified. Table 3 shows the notification condition of the example we gave above. The first four lines reduce the list of people present during the multimedia creation, since the user wants to notify only her nearby friends. The lines five
Table 3 SWRL example of notification condition Person(?person) ^ Owner (?owner) ^ foaf:knows (?person, ?owner) ^ SocialContext(?scxt) ^ hasContextElement(?scxt, ?person) ^ Place(?place) ^ SpatialContext(?spcxt) ^ hasUserDescription(?place, “home”) → hasToBeNotified (?person, true)
416
Multimed Tools Appl (2011) 53:391–429
and six specify that the place name must be “home”. If all these two conditions are true, the property hasToBeNotified will be true for all people detected. When a user publishes a new multimedia document, the system reads the annotation ontology of the document. The annotation, as we mentioned before, contains the result of the interpretation and inference processes. With the enriched annotation, the system applies the notification rules. Finally, it reads the ontology to know who should be notified that a new multimedia document was published. The notification contains a URL pointing to the document. A simpler sharing method, derived from the notification one, is also available from the mobile application. It consists in sending the created multimedia to all the nearby devices that belong to a friend of the user (i.e., a member of her social network). This service can be accessed in one click from the sharing interface and it uses Bluetooth for document transmission. 7.2 Context-based MMS The third sharing method we propose uses context-based MMS service. This sharing service contains a context-based parser for semi-automatic generation of the messages to which are attached to the videos or the photos the user wants to share. The underlying idea is to simplify the user’s task when writing the message and to give her the possibility of changing it. The MMS service contains a vocabulary of reserved words that will be replaced with the attributes values of the multimedia document creation context. The user can write a message by combining the reserved words and normal words. The MMS parser replaces the reserved words with information obtained from the high level context data. Table 4 shows some of these reserved words and their related context elements. For example, the user can write “hru?, come here, i m .lc .no with .np btw it’s .wt”10 and the system will generate “hru?, come here, i m at Montmartre, near Sacre Coeur (120 m) with Bob and Alice (Bob’s friend) btw it’s warm (32°C). In this example, “32°C” is the low-level data obtained from the weather forecast service and “warm” is inferred based on the rules defined by the user (see section 4). When the user is writing the message, he can access the available vocabulary and the current values of the context elements/attributes. When it’s not possible for the system to infer the information (e.g., there is no place name associated with the address or the coordinates), the MMS parser proposes to replace the reserved word with a low-quality equivalent value. For example, if the name of a place cannot be inferred, the address will be proposed as an alternative value.
8 Implementation and experimentations In order to validate our approach, we have transformed PhotoMap [40], into a more complete multimedia management system, called CoMMediA (Context-aware Mobile and Multimedia Architecture). Our system was developed with a distributed application design and it is composed by three subsystems: 1) a J2ME11-based mobile client, 2) a J2SE desktop tool, and 3) a J2EE Web System.
10 11
hru: How are you; btw: by the way http://java.sun.com/javame/index.jsp
Multimed Tools Appl (2011) 53:391–429
417
Table 4 Reserved words of our context-based MMS vocabulary Reserved word
User’s context information
Examples
.ad
User’s current address
1, Avenue 1, Paris, France
.lc
Place name
home, job, place du Tertre
.tm
Date/time
03/04
.wt
Weather and High Level Weather warm (32°C, sunny)
.dr
Duration in the same place
.np
Nearby people
Bob, Alice (Bob’s friend)
.no
Nearest object
Eiffel Tower(200 m)
.nos
Nearby objects
Musée du Louvre (inside), Pyramide du Louvre (200 m)
3h
Figure 11 shows an overview of the proposed multimedia management system and the interactions between the subsystems. In this section, we describe the development and the integration new annotation, sharing, indexing and retrieval methods. 8.1 CoMMediA mobile application The mobile application, developed with XMobile [40], runs on J2ME MIDP-enabled devices. The mobile client executes over a component-based middleware similar to OSGi12 that allows the application to dynamically adapt the way the context data are gathered. The previous PhotoMap mobile application worked in a non-connected mode. The client was just capturing local low-level context information (e.g., GPS coordinates, camera properties) and the photo annotation was enriched only when the photo was uploaded on the server. In the contrary, the CoMMediA mobile client can change the connection mode according to the user preferences and to the running services. For instance, when the user only wants to manually annotate the photos or the videos, the mobile client runs similarly to the PhotoMap mobile application. However, when the user starts the sharing services, the CoMMediA mobile client runs in a network-connected mode. Whenever the user wants to send a context-aware MMS or to publish a photo with enriched annotation, the mobile application will try to establish the user’s context by exchanging data with the server. In order to achieve this task, the CoMMeDiA server provides the context inferring process as a Web Service. To this end, we have developed a communication layer for the mobile application that uses the J2ME Web Services API (JSR 172). 8.2 CoMMediA desktop application As mentioned before, the mobile application allows users to create context-aware multimedia collections even in a non-connected mode. In a postponed situation, the users can synchronize their mobile devices with our server application by using a desktop Java application. The central purposes of the CoMMediA desktop tool is to allow users to select the multimedia they plan to publish on the Web and to decrease the processing time of the server inference process. All the methods for enriching the context information that do not depend on data from the server will be executed locally, on the user’s computer. For 12
http://www.osgi.org/
418
Multimed Tools Appl (2011) 53:391–429
Fig. 11 CoMMediA overview
instance, the calculus of the temporal attributes (season, day of the week) or the access to third-party Web Services (e.g., weather, light status, Wikipedia georeferenced articles) are also implemented in the J2SE application. In order to enrich the annotation metadata with information concerning points of interest, we have used the Geonames Service that provides the methods for querying the Wikipedia georeferenced articles. The user starts the desktop application by indicating the access path for the ontology file that describes the context-aware collection (for more details see the description of ContextPhoto [40]. For manipulating the ontology file generated by the mobile device, we use the Jena Semantic Web Framework13. After the acquisition of the low level context data, the desktop application starts the local process of annotation enrichment. Once the new data is acquired (e.g., the weather conditions) or calculated (e.g., the period of the day), the J2SE application invokes a CoMMediA Web Service in order to complete the context interpretation and inference process. For example, the social reasoning will be processed by the server since it knows the user’s social network. The final context information inferred by the server will be presented on the desktop application. Figure 12 shows two screens of the desktop application. Using the first interface (the left part of Fig. 12), the user chooses the multimedia documents she wants to publish on the Web. On the second interface (the right part of Fig. 12) the inferred annotation are presented to the user who can correct and validate them. After the user’s validation, the selected multimedia documents and the enriched annotations are sent to the server.
13
http://jena.sourceforge.net/
Multimed Tools Appl (2011) 53:391–429
419
Fig. 12 Desktop application screenshots
8.3 CoMMediA server The CoMMediA server is a Web system developed with J2EE, Struts, Jena, Jess, and Ajax. It offers interfaces for navigating into the multimedia collections, for viewing the multimedia annotation, and for retrieving a required multimedia by using tags and spatial queries. Initially, the PhotoMap Web system has supporting only spatial and temporal navigation with Google Maps and temporal lists, and it had offered a spatial query interface for searching photos with zoom and pan operations [39]. In order to validate our approach of semantic indexing, we have extended PhotoMap interfaces with a keyword-based query interface and a tag cloud. In addition, the system currently supports other types of mobile multimedia documents. Figure 13 shows the main view of the system, for a context-aware multimedia collection created in Paris, as well as the generated tag cloud and part of the itinerary followed by the user. Figure 14 presents the semantic keyword-based interface that we have integrated into the PhotoMap Web system. In addition to the new interfaces, CoMMediA exports a set of core functionalities as Web Services that provide the communication between the desktop/ mobile applications with the server. For example, the context inference process can now be accessed from the mobile device when the user wants to share a multimedia document. Some other core functionalities of Photomap have been adapted in order to support our semantic indexing approach. For example, originally, PhotoMap used the Geonames Address Service14 and the Address Finder15 in order to specify the address on Earth where a multimedia document was taken. However, these services do not provide sufficiently detailed information for supporting the spatial index expansion we propose. We have used a geographic database (i.e., 5 MB MapInfo file) containing the territorial units of France. This database contains the name, the geographic boundaries, and the administrative status (e.g., department capital) of 36656 cities. We have used the PostGreSQL DBMS and the PostGIS extension for creating our spatial ontology and for 14 15
http://www.geonames.org/export/ws-overview.html http://ashburnarcweb.esri.com/
420
Multimed Tools Appl (2011) 53:391–429
Fig. 13 CoMMediA server main view
computing the semantic similarities (i.e., SemSim) between the city concepts. The generation of the ontology takes less than 1 h 30 min on a standard desktop computer (2.4 GHz and 1 G of RAM). The time was longer than we had initially expected because we also calculated the neighbouring relation between cities. We have exported the data results into an OWL-DL file. A new component has been developed for generating the stamped terms and the extended terms of each multimedia document. During the multimedia indexing process, we used Jena16 for handling the context-aware metadata and the spatial and temporal ontologies. 8.4 Experiments We have lead three experiments in order to validate our semantic retrieval approach. In these experiments, we decided to evaluate only the photo retrieval process since we do not have yet many collections of video and audio in our database, and it is difficult to find georeferenced videos/audios on the Web. We have used the photos existing in PhotoMap (i.e., a set of 400 automatically geotagged photos) in order to calibrate the process of spatial index expansion. We have obtained a =1, β=2 and θ=0,65 as the best parameter values for the spatial similarity computation.
16
http://jena.sourceforge.net/
Multimed Tools Appl (2011) 53:391–429
421
Fig. 14 Keyword-based interface
As the set of images available in PhotoMap was not large enough for our needs, we have decided to increase our corpus with photos imported from Flickr. As a consequence, we gain a more accurate evaluation of our retrieval approach. Thus, we have developed a Java application that downloads georeferenced photos by using the Flickr API17 and generates an annotation file (i.e., an instance of ContextMultimedia). This ontology file includes the date, the geographic coordinates and the manual tags of each Flickr photo. The Java application provides, as a result, the same information we obtain from the CoMMediA mobile application in a non-connected mode. We have also used the interpretation and inferred processes in order to enrich the metadata about each photo. For example, we have added address, nearby Wikipedia objects14, and light status as photo metadata. These enriched metadata are exploited for the generation of index terms that support our keywordbased query process. We have built a corpus of 8007 photos with 10630 terms and 138 extended terms, including photos taken in the same regions or with the same manual tags of our photos. The first experiment measures the importance of contextual metadata for photo retrieval. We selected ten photos from our database and we tried to retrieve them in the new corpus by using different combinations of contextual terms. We computed the Mean Rank and Mean Reciprocal Rank [41] in order to find out, among the different subsets of photo metadata, which are the best for retrieving a photo. The results of this experiment are shown on the Table 5. A desired photo could be ranked low if only temporal information or a city name is used in a query (i.e. the worst mean reciprocal ranks: 0,054 and 0.12). The problem of the city query is that most of the corpus is concentrated on a few cities (i.e., Paris, Grenoble, Saint Denis, Vizille, Marseille, Saint-Malo). Hence, a city name is not sufficient for discriminating 17
http://www.flickr.com/services/api/
422
Multimed Tools Appl (2011) 53:391–429
Table 5 Mean rank and mean reciprocal rank computation Query term type
Mean Rank
Mean Reciprocal Rank
City
142,85
0,12
City, Light Status, Season, Month, Year
11,85
0,21
Light Status, Season, Month, Year
34,14
0,054
Wikipedia Wikipedia, Light Status, Season, Year
43,6 6,2
0,162 0,475
a photo. However, the corpus reflects the behaviour of a user who usually takes photos in the same regions, which could be a frequent behaviour. When combining the name of city with some temporal information, we obtain better ranking results. The use of Wikipedia object names with temporal metadata has given the best ranking results, which shows the importance of nearby objects for retrieving a photo if the user remembers where and when she has taken the photo. The second experiment we have undertaken compares the use of manual tags and Wikipedia object names in the retrieval of photos of an object. We make queries with the keywords: Louvre Museum, Eiffel Tower, Notre Dame de Paris and Stade de France. First, we wrote queries with these keywords as content tags. In this case, the system answers with photos manually annotated with these keywords. Afterwards, we reformulated these queries as location tags. The system returns photos taken near the location of the Wikipedia objects which have these keywords as title names. The rank of the photos depends directly on their distance to the objects. We have measured the mean precision of the results in different top ranks (e.g., top 3 images, top 50 images). We consider an answer as correct if the image contains the object. The results are shown in the Fig. 15. The rose-dotted line represents the mean precision comparing the keywords with manual annotation, and the blue one shows the mean precision when the keyword is matched with nearby tags. Regarding the blue line on the graphic, we can conclude that when a photo is taken close to the coordinates of an object, the average precision is not good. This result is due to the fact that, in some cases, the photo shows objects that are situated far from the location of the Wikipedia objects but are visible from their locations (e.g., one of the top ranked photos of the query “Eiffel Tower” shows the view of the Arc de Triomphe from the Eiffel Tower). In order to alleviate this drawback, the spatial relations proposed in the Section 4 will be considered in next experiments. 1 0.9 Average Precision
0.8 0.7 0.6 0.5 0.4 0.3 0.2
Manual
0.1
Wiki
0 3
10
20
30
Top X rank
Fig. 15 Manual tags versus Wikipedia names entries
50
100
Multimed Tools Appl (2011) 53:391–429
423
Regarding the rose-dotted line, we observe the decrease of the mean precision in these queries when the number of photos ranked increases. This behaviour is due to several biases on the manual tags added by Flickr users. One conclusion of this experiment is that in large collections of personal images, sightseeing photos can be found thanks to the integration of Wikipedia georeferenced articles. In the third experiment, we compared the use of a) the photo indexing without semantic stamps, b) the semantic photo indexing without spatial expansion, and c) the semantic photo indexing with spatial expansion. We made a search in the corpus with the keywords “Paris and football”. For the query without semantic stamps, we obtained 72 photos in different locations of France where the Paris Saint German (PSG) football team has at least once played, mixed up with photos of the stadiums in Paris (e.g.; Parc de Princes) and photos of the Stade de France. The Stade de France is located in the suburbs of Paris, but, as we have supposed, Flickr users have added Paris as a manual annotation. For our proposal without spatial expansion, the system answered with fewer images (26), only those really taken within Paris city limits (see Fig. 16). When we used the spatial expansion, we obtain 98 photos that were taken either in Paris, or in Saint Denis (i.e., on the Stade de France) (see Fig. 16). It is difficult to compare these three query approaches measuring the recall and the precision (What is a correct photo for these types of queries?), however, it is clear that stamped terms the approach offers a more semantic and much more accurate query model, and that the spatial expansion process can be used to increase the recall when the system returns few results. As a consequence, in our retrieval implementation, we first execute a query processing without the expanded terms. If only few results are found, we offer to the user the possibility to expand their query (see Fig. 14).
9 Conclusion and future work In this paper, we have presented an approach for semantic and context-aware management of mobile multimedia documents. We have used Semantic Web standards (OWL, SWRL) and context-aware tools for improving annotation, organisation, retrieval and sharing of mobile multimedia documents. In order to both increase and personalize the richness of the contextual metadata, we have extended our previous annotation approach with spatial reasoning and with user-defined inference rules. We have also exploited the improved
Fig. 16 Results of query retrieval without and with spatial expansion
424
Multimed Tools Appl (2011) 53:391–429
contextual metadata for the generating context-based MMS and for supporting contextaware notifications of shared multimedia documents. An adaptation of the simple, but powerful, vector space model has been proposed for keyword-based retrieval. Five dimensions have been used for classifying the contextual metadata and for building the multimedia document indexes by using keywords: spatial, temporal, spatiotemporal, manual and social. Hence, users can formulate queries that combine keywords from the different dimensions and keep some of the relation semantics holding in each keyword. Moreover, we overcome the shortcomings of syntactic comparisons by considering potentially related terms in order to expand spatially the multimedia document index. The experiments we have undertaken show the benefits of using contextual metadata for indexing personal multimedia documents and the usefulness of the spatial expansion process. In order to better evaluate the effectiveness of our retrieval approach, we will conduct further experiments on diverse types of multimedia documents using additional semantic stamps extracted from ContextMultimedia ontology. A user evaluation of our context-aware sharing services will also be realised. We intend to integrate into our framework a model of access control based on the work presented in [9]. This model ensures the access control by verifying the quality attributes of the context gathering data (QoC) which we will adapt for context-aware multimedia resources.
References 1. Aizawa K, Hori T, Kawasaki S, Ishikawa T (April 20, 2004) Capture and efficient retrieval of life log, Pervasive 2004 Workshop on Memory and Sharing Experiences pp.15-20, Vienna, Austria 2. Aleksy M, Butter T, Schader M (2008) Architecture for the development of context-sensitive mobile applications. Mobile Inf Syst 4(2):105–117 3. Ames M, Naaman M (2007) Why we tag: motivations for annotation in mobile and online media. CHI '07: Proceedings of the SIGCHI conference on Human factors in computing systems. CHI '07. ACM, New York, NY, 971-980 4. Bloehdorn S, Petridis K, Saathoff C, Simou N, Tzouaras V, Avrithis Y, Handschuh S, Kompatsiaris Y, Staab S, Strintzis MG (2005) Semantic annotation of images and videos for multimedia analysis, in: Proceedings of the 2nd European Semantic Web 5. Buscaldi D, Rosso P, Peris P (August 2006) Inferring geographical ontologies from multiple resources for geographical information retrieval. In: C Jones, R Purves (eds) Proceedings of 3rd SIGIR Workshop on Geographical Information Retrieval 6. Carvalho RF, Chapman S, Ciravegna F (2009) Attributing semantics to personal photographs. Multimed Tools Appl 42(1):73–96 7. Christensen CB (2001) Place and experience: a philosophical topography. Mind, Oxford University Press, Volume 110, Number 439, 1 July 2001, pp 789-792(4) 8. Dey AK, Abowd GD (2000) Towards a better understanding of context and context-awareness. HUC '99: Proceedings of the 1st international symposium on Handheld and Ubiquitous Computing, SpringerVerlag, London, UK, 304-307 9. Filho JB, Martin H (2008) QACBAC: an owner-centric QoC-aware context-based access control model for pervasive environments. In Proceedings of the SIGSPATIAL ACM GIS 2008 international Workshop on Security and Privacy in GIS and LBS (Irvine, California, November 04 - 04, 2008). SPRINGL '08. ACM, New York, NY, 30-38 10. Frank AU (1996) Qualitative spatial reasoning: cardinal directions as an example. Int J Geogr Inf Sci 10 (3):269–290 11. Goyal RK, Egenhofer M (2001) Cardinal directions between extended spatial objects. IEEE Trans Knowl Data Eng (in press) 12. Gulliver SR, Ghinea G, Patel M, Serif T (2007) A context-aware Tour Guide: user implications. Mobile Inf Syst 3(2):71–88 13. Guo B, Satake S (2008) Michita Imai: home-explorer: ontology-based physical artifact search and hidden object detection system. Mobile Inf Syst 4(2):81–103
Multimed Tools Appl (2011) 53:391–429
425
14. Hammiche S, Lopez B, Benbernou S, Hacid M-S (2008) Query rewriting for semantic multimedia data retrieval. Adv Comput Intel Ind Syst, 351-372 15. Heesch D (2008) A survey of browsing models for content based image retrieval. Multimed Tools Appl 40(2):261–284 16. Hendler J (2008) Web 3.0: chicken farms on the semantic web. Computer 41(1):106–108 17. Hollink L, Nguyen G, Schreiber G, Wielemaker J, Wielinga B, Worring M (2004) Adding spatial semantics to image annotations. Proc. of 4th International Workshop on Knowledge Markup and Semantic Annotation 18. Jones CB, Alani H, Tudhope D (2001) Geographical information retrieval with ontologies of place. In Proceedings of the international Conference on Spatial information theory. Lecture Notes In Computer Science, vol. 2205. Springer-Verlag, London, 322-335 19. Koolwaaij J, Tarlano A, Luther M, Nurmi P, Mrohs B, Battestini A, Vaidya R (2006) Context watcher: sharing context information in everyday life. Web Technologies, Applications, and Services: WTAS 2006: Acta Press 2006 20. Liu Y, Zhang D, Lu G, Ma WY (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recognit 40(1):262–282 21. Lux M, Klieber W, Granitzer M (2004) Caliph & Emir: semantics in multimedia retrieval and annotation, Proceedings of 19th International CODATA Conference: The Information Society: New Horizons for Science, Berlin, Germany 22. Matellanes A, Evans A, Erdal B (May, 2006) Creating an application for automatic annotations of images and video. Proc. of 1st International Workshop on Semantic Web Annotations for Multimedia (SWAMM), Edinburgh, Scotland 23. Miron AD, Gensel J, Villanova-Oliver M (2007) Towards the geo-spatial querying of the semantic web with ONTOAST. 7th International Symposium on Web and Wireless GIS (W2GIS 2007), Cardiff, UK 24. Moisuc B, Davoine PA, Gensel J, Martin H (2005) Design of spatio-temporal information systems for natural risk management with an object-based knowledge representation approach. Geomatica 59 25. Monaghan F, O'Sullivan D (2006) Automating photo annotation using services and ontologies, MDM'06: Proceedings of the 7th International Conference on Mobile Data Management. IEEE Computer Society, Washington, pp 79–83 26. Naaman M, Harada S, Wang Q, Garcia-Molina H, Paepcke A (2004) Context data in geo-referenced digital photo collections. MULTIMEDIA '04: Proceedings of the 12th annual ACM international conference on Multimedia, 196-203, New York, NY, USA 27. O'Hare N, Gurrin C, Jones GJF, Lee H, O'Connor NE, Smeaton AF (2007) Using text search for personal photo collections with the MediAssist system. In Proceedings of ACM SAC 2007 28. Raento M, Oulasvirta A, Petit R, Toivonen H (2005) 2005. ContextPhone: a prototyping platform for context-aware mobile applications. Pervasive Computing, IEEE 4:51–59 29. Ressler J, Dean M (2007) Geospatial ontology trade study. Ontology for the Intelligence Community (OIC-2007), Columbia, Maryland 30. Riley M, Heinen E, Ghosh J (2008) A text retrieval approach to content-based audio retrieval, Proceedings of the Ninth International Conference on Music Information Retrieval, September 31. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18 (11):613–620 32. Sarvas R, Herrarte E, Wilhelm A, Davis M (2004) Metadata creation system for mobile images. MobiSys '04: Proceedings of the 2nd international conference on Mobile systems, applications, and services, ACM, Boston, MA, USA, 36-48 33. Schroeter R, Hunter J, Kosovic D (2003) Vannotea, A collaborative video indexing, annotation and discussion system for broadband networks, in: Proceedings of the K-CAP 2003 Workshop on Knowledge Markup and Semantic Annotation, October 2003, Florida 34. Schweer A, Hinze A. The Digital Parrot: Combining Context-Awareness and Semantics to Augment Memory, Workshop "Supporting Human Memory with Interactive Systems", HCI Conference, September 4th, 2007, Lancaster, UK 35. Tuffield M, Harris S, Dupplaw DP, Chakravarthy A, Brewster C, Gibbins N, O'Hara K, Ciravegna F, Sleeman D, Wilks Y, Shadbolt NR (2006) Image annotation with Photocopain. In: First International Workshop on Semantic Web Annotations for Multimedia (SWAMM 2006) at WWW2006, May 2006, Edinburgh, United Kingdom 36. Uren V, Cimiano P, Iria J, Handschuh S, Vargas-Vera M, Motta E, Ciravegna F (2006) Semantic annotation for knowledge management: requirements and a survey of the state of the art. Web Semant Sci Serv Agents World Wide Web 4:14–28 37. Viana W, Castro RMC, Machado J, Filho B, Magalhaes K, Giovano C (2004) Mobis: a solution for the development of secure applications for mobile device. In: ICT: International Conference on Tele-
426
38. 39. 40. 41. 42.
43.
Multimed Tools Appl (2011) 53:391–429 communications, 11th, Fortaleza-CE, Brazil. Lecture Notes in Computer Science, Vol. 3124, pp 10151022, 2004 Viana W, Andrade RMC (2008) XMobile: a MB-UID environment for semi-automatic generation of adaptive applications for mobile devices (2008a). J Syst Softw 81(3):382–394 Viana W, Bringel F, Gensel J, Villanova-Oliver M, Martin H (2007) PhotoMap automatic spatiotemporal annotation for mobile photos. Lect Notes Comput Sci 4857:187–201 Viana W, Bringel F, Gensel J, Villanova-Oliver M, Martin H (2008) PhotoMap: from location and time to context-aware photo annotations. J Location Based Serv 2(3):211–235 Voorhees EM (1999) The TREC-8 question answering track report. In Proceedings of the 8th Text Retrieval Conference, Gaithersburg, Maryland, USA, pp 77-82 Wang S, Jing F, Jing F, He J, Du Q, Zhang L. IGroup: presenting web image search results in semantic clusters. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '07. ACM Yamaba T, Takagi A, Nakajima T (2005) Citron: a context information acquisition framework for personal devices, Proc. of 11th International Conference on Embedded and real-Time Computing Systems and Applications, 489-495
Windson Viana obtained his Ph.D. degree in February 2010 from the University Joseph Fourier (Université de Grenoble) in Grenoble, France. He received his BS degree (2002) and his MS degree (2005) in Computer Science from the Federal University of Ceará, Brazil. He joined the LIG (Laboratoire d’Informatique de Grenoble) in 2005. His research interests include context-awareness, mobile computing, multimedia management and Web-semantic technologies.
Multimed Tools Appl (2011) 53:391–429
427
Alina Dia Miron has obtained her PhD degree in December 2009, from the School of Computer Science at the Joseph Fourier University of Grenoble, for her work on spatio-temporal ontologies for the Geospatial Semantic Web. She joined the Laboratory of Informatics in Grenoble (LIG, formerly called LSR-IMAG Laboratory) in 2005. Her research interests include spatial and temporal knowledge representation and reasoning techniques for the Semantic Web, ontology engineering, spatio-temporal semantic analysis, Geospatial Semantic Web.
Bogdan Moisuc is a R&D engineer at the Grenoble Computer Science Laboratory since 2008. He received his Ph.D. in 2007 at the University Joseph Fourierin Grenoble, France, for his work on adaptive spatio-temporal information systems design. His research interests include adaptability in mobile and Web GIS, object-based spatiotemporal knowledge representation and modelling for spatiotemporal data and visualisations.
428
Multimed Tools Appl (2011) 53:391–429
Jérôme Gensel is a Full Professor at the University Pierre Mendès France of Grenoble, France, since 2007. He received his PhD in 1995 from the University of Grenoble for his work on Constraint Programming and Knowledge Representation in the Sherpa project at the French National Institute of Computer Sciences and Automatics (INRIA). He joined the Laboratory of Informatics in Grenoble (LIG, formerly called LSR-IMAG Laboratory) in 2001. His research interests include Representation and Inference of Spatio-Temporal Information, Ontologies and Knowledge Representation, Geographic Semantic Web, and Ubiquitous Geographical Information Systems.
Marlène Villanova-Oliver is an Assistant Professor at the University Pierre Mendès France of Grenoble, France, since 2003. In 1999, she received her MS degree in Computer Science from the University Joseph Fourier of Grenoble and the European Diploma of 3rd cycle in Management and Technology of Information Systems (MATIS). She received her Ph.D. in 2002 from the National Polytechnic Institute of Grenoble (Grenoble INP). She is a member of the Laboratory of Informatics in Grenoble (LIG, formerly called LSRIMAG Laboratory) since 1998. Her research interests include Representation and Inference of SpatioTemporal Information, Ontologies and Knowledge Representation, Geographic Semantic Web, adaptability to user and context in Web-based Information Systems.
Multimed Tools Appl (2011) 53:391–429
429
Hervé Martin is a Professor at University Joseph Fourier, Grenoble I, France and also associate professor at University Laval, Quebec, Canada. He received his PhD in computer science in 1991 and the "Habilitation à Diriger des Recherches" in 2000 both from the University of Grenoble I. He published over 100 papers in the areas of database systems, multimedia information systems, and geomatics. He is member of the editorial board of the international journal multimedia tools and applications. His current research interests are in the area of spatial and multimedia information systems and he is a head of a research group in this domain at the Informatics Laboratory of Grenoble (LIG). Since January 2009, he is president of the French Society for Education and Research in Computer Science (SPECIF).